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PREFACE 


As  a  cost  control  measure,  the  federal  government  established  a  prospective 
payment  system  (PPS)  in  1983  to  reimburse  hospitals  for  inhospital  care  of  Medicare 
patients.  PPS  changed  the  way  Medicare  reimbursed  hospitals  from  a  cost  or  charge 
basis  to  a  prospectively  determined,  fixed-price  system  in  which  hospitals  are  paid 
according  to  the  diagnosis  related  group  (DRG)  into  which  a  patient  is  classified.  The 
fixed  payment  per  patient  provides  financial  incentives  for  hospitals  to  reduce  both  length 
of  stay  and  intensity  of  care,  giving  rise  to  concerns  that  the  quality  of  care  may  have 
declined  under  PPS. 

In  October  1985,  The  RAND  Corporation  began  a  study,  funded  by  the  Health 
Care  Financing  Administration  (HCFA),  to  evaluate  the  impact  of  the  DRG-based  PPS 
system.  The  project  was  undertaken  in  collaboration  with  the  Professional  Review 
Organizations  (PROs)  in  five  states.  The  objective  of  this  four-year  study  was  to 
determine  whether  PPS  cost-containment  efforts  have  affected  the  quality  of  care 
received  by  hospitalized  Medicare  patients. 

Six  diseases  or  conditions  from  among  the  most  common  and  most  serious 
admitting  diagnoses  of  elderly  Medicare  patients  were  selected  for  study:  congestive 
heart  failure,  acute  myocardial  infarction,  hip  fracture,  pneumonia,  cerebrovascular 
accident,  and  depression.  Well-defined  diagnostic  criteria  exist  for  these  diseases,  and 
their  treatment  was  fairly  stable  from  1981  to  1986.  In  addition,  for  these  diseases,  the 
patient's  medical  record  contains  the  information  necessary  to  assess  patient  sickness  at 
the  time  of  hospital  admission  and  the  quality  of  hospital  care.  Hip  fracture  and 
depression  were  chosen  to  provide  data  about  care  for  surgical  and  psychiatric 
conditions,  to  supplement  the  information  provided  by  the  four  medical  conditions. 

Five  states  were  selected  for  study,  each  from  a  different  geographic  region  of  the 
nation:  California,  Florida,  Indiana,  Pennsylvania,  and  Texas.  Each  state  was  divided 
into  10  to  12  study  areas,  and  data  were  gathered  from  a  sample  of  four  to  eight  areas  in 
each  state,  for  a  total  of  30  areas. 

To  evaluate  the  impact  of  PPS,  a  time  series  design  was  used  to  compare  patients 
hospitalized  before  PPS  implementation  (January  1981  to  December  1982)  with  patients 
hospitalized  in  the  same  institutions  following  PPS  implementation  (July  1985  to  June 
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1986).  Twenty  percent  of  the  total  patient  sample  was  collected  in  1981,  30  percent  in 
1982,  and  50  percent  in  the  post-PPS  time  period. 

The  sample  is  representative  of  patients  nationally  with  respect  to  the  percentage 
of  hospitals  in  urban  areas,  their  size,  the  number  of  house  staff  per  hospital  bed,  and  the 
percentage  of  Medicare  admissions.  City/county  hospitals  and  hospitals  with  many 
Medicaid  patients  were  oversampled  to  better  estimate  the  effect  of  PPS  on  poor  patients. 

Of  the  305  hospitals  asked  to  participate,  297  actually  did  so.  The  planned  study 
sample  was  17,000  patients.  To  collect  data  on  this  number  of  patients,  21,925  medical 
records  were  selected  for  review.  During  medical  record  review,  5,167  patients  did  not 
meet  clinical  criteria  for  having  the  selected  diseases  (even  though  the  record  was  coded 
as  such),  and  these  were  excluded  from  the  study.  Hospitals  were  unable  to  locate  870 
records.  The  final  study  sample  consisted  of  16,758  patients. 

Both  explicit  and  implicit  measures  were  used  to  assess  quality  of  care.  Explicit 
measures  use  preset  criteria  to  evaluate  each  patient's  care.  Implicit  measures  use 
physician  judgment  to  assign  quality  of  care  rating.  A  10  percent  subsample  of  the  entire 
study  sample  of  medical  records  underwent  implicit  review.  The  sample  of  medical 
records  receiving  implicit  review  was  evaluated  by  physician  specialists  from  each  of  the 
five  study  states. 

Disease-specific  abstraction  forms  were  developed  to  collect  explicit  data  from  the 
patient's  medical  record  about  sickness  at  admission,  process  of  care,  inhospital 
outcomes,  and  patient  status  at  discharge.  These  measures  have  been  reported 
elsewhere. 

In  developing  disease-specific  instruments  for  implicitly  measuring  quality  of 
care,  first  the  literature  on  quality  of  care  for  patients  hospitalized  with  each  disease  and 
the  literature  on  how  to  do  implicit  review  were  reviewed.  Then,  PRO-selected 
physicians  from  the  five  study  states  were  invited  to  participate  in  six  disease-specific 
panel  meetings  to  discuss  and  test  implicit  measures  of  quality  of  care.  Physicians  were 
asked  to  detail  the  problems  they  had  previously  experienced  in  implicitly  reviewing 
records  for  purposes  of  inhospital  assessment  (e.g.,  morbidity  and  mortality  conferences) 
or  PRO  review.  In  addition,  advice  was  sought  about  the  way  medical  records  might 
differ  by  type  of  hospital  or  geographic  area  so  that  the  implicit  review  form  for 
measuring  quality  of  care  could  be  developed  for  use  in  all  kinds  of  acute  care  general 
hospitals  in  the  United  States.  For  details  of  these  implicit  review  measure  see: 
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Kahn,  Katherine  L.,  Lisa  V.  Rubenstein,  Marjorie  J.  Sherwood,  and 
Robert  H.  Brook,  Structured  Implicit  Review  for  Physician 
Measurement  of  Quality  of  Care:  Development  of  the  Form 
and  Guidelines  for  Its  Use,  The  RAND  Corporation, 
N-3016-HCFA,  1989 


This  RAND  Note  reviews  the  principles  of  the  structured  implicit  review  method 
and  provides  analytic  results  of  studies  using  the  method. 
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SUMMARY 


To  measure  the  quality  of  medical  care  before  and  after  implementation  of  the 
prospective  payment  system  (PPS)  by  Medicare,  we  developed  methods  for  both  implicit 
and  explicit  review  of  the  medical  record.  Explicit  review  relies  upon  measurement  of 
the  degree  to  which  care  accords  with  previously  specified  criteria.  To  perform  implicit 
review,  reviewers  use  their  own  unspecified  criteria  to  judge  care.  Our  method  of 
performing  implicit  review  was  structured,  requiring  the  reviewing  physicians  to  analyze 
specified  parts  of  the  medical  record  and  to  assess  their  quality  using  a  rating  scale.  Each 
implicit  judgment  was  thus  recorded  as  an  item  on  the  structured  implicit  review  form. 
Items  could  then  be  combined  into  larger  scales.  We  applied  our  structured  implicit 
review  (SIR)  to  a  sample  of  1,366  Medicare  patients  with  congestive  heart  failure,  acute 
myocardial  infarction,  pneumonia,  cerebrovascular  accident,  or  hip  fracture  who  were 
hospitalized  in  1981-1982  or  1985-1986.  The  implicit  review  sample  was  a  random 
subset  of  the  14,012  Medicare  patients  included  in  the  PPS  study.  The  demographic 
characteristics  of  patients  in  the  implicit  review  sample  were  similar  to  those  of  the  larger 
sample  from  which  it  was  drawn. 

To  determine  the  performance  characteristics  of  structured  implicit  review,  we 
evaluated  interitem  and  interrater  reliability.  Interitem  reliabilities  for  scales  were 
between  0.8  and  0.9.  A  randomly  selected  sample  of  about  25  percent  of  all  records  was 
reviewed  by  two  reviewers,  and  a  3  percent  sample  was  reviewed  by  five  reviewers.  We 
found  a  significant  reviewer  effect,  with  some  reviewers  judging  care  more  harshly  and 
some  more  leniently.  After  adjusting  for  the  reviewer  effect,  interrater  reliabilities  for 
scales  were  between  0.4  and  0.7  and  thus  were  adequate  for  comparing  groups  of  people. 
At  this  level  of  reliability,  however,  using  the  Spearman-Brown  Prophesy  Formula, 
records  would  have  to  be  reviewed  by  four  to  seven  reviewers  to  be  95  percent  certain 
that  the  care  delivered  was  poor  or  very  poor.  In  contrast,  only  one  reviewer  would  be 
needed  to  be  sure  that  care  was  either  good  or  very  good.  Despite  the  limitations  in 
reliability  of  the  structured  implicit  review  method,  we  found  that  very  poor  quality  of 
care  was  associated  with  increased  death  rates  30  days  after  admission  (17  percent  with 
very  good  care  died  compared  with  30  percent  with  very  poor  care). 
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When  we  evaluated  the  quality  of  care  before  and  after  PPS,  we  found  that  the 
quality  of  medical  care  improved  between  1981-1982  and  1985-1986.  Twenty-five 
percent  of  patients  received  poor  or  very  poor  care  during  the  earlier  period,  whereas 
only  12  percent  did  in  the  later  period.  More  patients,  however,  were  judged  to  have 
been  discharged  too  soon  and  in  unstable  condition  during  the  later  period  (7  percent 
compared  with  4  percent).  Thus,  except  for  discharge  planning  processes,  the  quality  of 
hospital  care  has  continued  to  improve  for  Medicare  patients. 
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1.  INTRODUCTION 


The  early  entry  of  the  medical  profession  into  quality  review  strongly  influenced 
the  subsequent  evolution  of  American  medicine  (Codman,  n.d.).  Quality  review 
continues  to  be  a  major  feature  of  medical  practice  today.  The  two  most  common  quality 
assessment  and  assurance  methods  in  current  use  are  the  measurement  of  specific 
structural  characteristics  of  health  care  institutions,  such  as  infection  control  procedures, 
and  the  measurement  of  the  process  of  medical  care  using  implicit  review  of  the  medical 
record.  Implicit  review  depends  upon  a  practitioner's  expert  opinion  to  develop  a  quality 
rating;  explicit  review,  in  contrast,  uses  preset  criteria  to  judge  care.  Despite  its  active 
role  in  medical  care,  the  measurement  characteristics  of  implicit  review  have  not  been 
extensively  studied. 

In  addition  to  its  importance  as  a  common  method  of  quality  review,  implicit 
review  of  the  medical  record  has  the  added  significance  of  being  the  current  community 
gold  standard  for  performing  peer  review.  For  example,  implicit  review  is  used  as  the 
basis  for  any  final  actions  taken  based  on  quality  reviews,  such  as  denial  of  payment  or 
censure  of  a  health  care  provider.  Although  either  implicit  or  explicit  review  can  be 
performed  by  observing  physicians  and  patients,  for  practical  reasons  the  medical  record 
has  been  the  most  common  data  source  for  quality  assessment  (Petersen  and  Andrews, 
1956).  Thus,  when  Professional  Review  Organizations  (PROs),  hospital  quality 
assurance  committees,  and  insurance  companies  perform  quality  reviews,  physicians 
using  implicit  peer  review  of  the  medical  record  are  always  the  final  judges  of  care, 
although  medical  records  for  review  are  often  selected  by  nurse  reviewers  using  explicit 
screening  criteria. 

PURPOSE 

The  purpose  of  this  study  was  to  improve  implicit  review,  to  determine  the 
reliability  and  validity  of  the  improved  method,  and  then  to  use  it  to  evaluate  changes  in 
the  quality  of  medical  care  for  Medicare  patients  in  the  United  States  between  1981  and 
1986.  We  report  as  well  upon  the  relationship  between  judgments  of  quality  based  on 
implicit  and  explicit  review.  Our  goals  in  improving  implicit  review  were  to  develop 


training  and  review  protocols  that  were  specific  enough  to  ensure  that  physicians  were 
basing  their  judgments  on  the  same  assumptions  and  information.  We  wanted 
disagreements  about  care  to  reflect  true  differences  of  opinion  among  reviewers  rather 
than  differences  in  the  way  the  record  was  reviewed  or  in  how  different  components  of 
quality  were  weighted  in  arriving  at  a  final  judgment.  We  have  called  our  method 
Structured  Implicit  Review  (SIR)  (Rubenstein  et  al.,  1990). 

BACKGROUND 

Researchers  in  the  late  1960s  and  early  1970s  developed  considerable  experience 
with  inpatient  implicit  medical  record  review.  Using  physicians  with  special  experience 
in  the  clinical  condition  under  study,  early  researchers  found  90  percent  agreement 
among  reviewers,  with  an  additional  7  to  8  percent  of  disagreements  resolved  after 
intrareviewer  consultation  (Morehead  et  al.,  1962;  Rosenfeld,  1957).  Subsequent  studies 
were  less  optimistic.  After  review  of  1,258  obstetrical  records  by  three  independent 
reviewers,  Richardson  (1972)  found  that  one-third  of  the  time  one  reviewer  disagreed 
with  the  other  two  about  the  adequacy  of  the  care  delivered.  To  facilitate  agreement 
among  more  diverse  groups  of  reviewers,  and  to  enable  hospitals  to  review  and  educate 
their  physicians,  Fine  and  Morehead  (1971),  under  the  auspices  of  the  Medical  Society  of 
the  State  of  New  York,  developed  a  series  of  implicit  review  forms  for  reviewers  to  fill 
in,  termed  "review  patterns."  Patterns  were  developed  for  acute  appendicitis,  fibroid 
uterus,  prostatic  hypertrophy,  acute  myocardial  infarction,  diabetes  mellitus,  and  cerebral 
vascular  accident.  These  review  patterns  seemed  to  facilitate  agreement  among 
reviewers,  although  formal  analysis  of  agreement  was  not  done. 

Richardson  (1972)  performed  a  more  rigorous  methodological  study  of  implicit 
review.  Richardson's  review  included  some  frankly  explicit  criteria,  as  well  as  an 
implicit  Overall  Quality  judgment.  In  his  study,  81  university-based  physician  specialists 
(obstetricians,  pediatricians,  or  surgeons)  were  required  to  judge  care  for  patients  with 
toxemia  of  pregnancy,  placenta  previa,  and  abruptio  placentae;  live  neonates  born  to 
these  patients;  gallbladder  or  biliary  tract  operations;  and  infantile  diarrhea.  Twenty-one 
obstetrical  charts,  10  surgical  charts,  and  10  pediatric  charts  were  scored  by  all  reviewers 
in  the  relevant  specialty.  Richardson  found  that  some  reviewers  were  more  likely  than 
others  to  judge  cases  as  unsatisfactory;  that  different  aspects  of  care  affected  the  final 
judgments  of  individual  reviewers;  and  that  overall  agreement  among  reviewers, 
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subtracting  that  part  of  the  agreement  that  was  due  to  chance  (Kappa  interrater 
reliability),  was  0.398  for  obstetrical  cases,  0.457  for  surgical  cases,  and  0.578  for 
pediatric  cases.  Using  these  results,  he  calculated,  using  the  Spearman-Brown  prediction 
equation,  that  between  16  and  28  judges  would  be  necessary  to  judge  an  individual  case 
with  a  reliability  of  0.95,  the  level  considered  acceptable  to  clinicians  whose  cases  were 
the  subject  of  review.  Peters  (1972)  reanalyzed  Richardson's  data  and  concluded  that 
although  cases  about  which  reviewers  disagreed  could  not  be  easily  judged,  unanimously 
unsatisfactory  cases  could  be  successfully  identified,  without  erroneously  including  a 
large  number  of  satisfactory  cases. 

Brook  (1973)  published  a  detailed  analysis  of  the  quality  of  care  given  to  296 
outpatients  with  urinary  tract  infection,  hypertension,  or  gastric  or  duodenal  ulcer.  He 
reviewed  detailed  case  abstracts.  Ten  internists  reviewed  one-third  of  the  296  case 
abstracts;  each  abstract  was  reviewed  by  three  reviewers.  Cases  were  supplemented  by 
information  about  the  patient's  medical  outcome  at  five  months.  Cases  were  judged  in 
terms  of  adequacy  of  the  medical  care  process,  unprovability  or  unimprovability  of  the 
outcome,  and  acceptability  of  the  medical  care  overall.  He  found  that  all  three 
physicians  agreed  that  care  was  acceptable  in  11  percent  of  cases  and  unacceptable  in  43 
percent.  Physician  pair  agreements  ranged  from  53  to  95  percent  for  process  of  care,  53 
to  100  percent  for  outcome,  and  37  to  94  percent  for  quality  of  care.  Kappas  on 
physician  pairs,  calculated  subsequently  by  Koran  (1975a),  averaged  0.29. 

The  interrater  reliabilities  reported  in  these  studies  appear  lower  than  we  might 
hope,  but  these  reliabilities  are  not  dramatically  different  from  those  obtained  by 
comparing  various  features  of  the  actual  clinical  exam,  or  interpreting  laboratory  tests 
such  as  cardiograms.  Koran  (1975b)  systematically  reviewed  levels  of  agreement  about 
these  tests  as  reported  in  a  wide  range  of  studies.  Agreement  about  the  presence  or 
absence  of  the  dorsalis  pedis  pulse  was  39  percent  (Meade  et  al.,  1968;  Light,  1971); 
about  presence  or  absence  of  varices  on  esophagoscopy,  46  percent  (Conn  et  al.,  1965); 
and  whether  an  electrocardiogram  (EKG)  was  normal  or  abnormal,  84  percent  (Acheson, 
1960).  Kappas  for  these  levels  of  agreement  would  be  lower  still. 

In  a  more  recent  study,  Dubois  et  al.  (1987a,  1987b)  used  both  implicit  and 
explicit  review  to  study  the  quality  of  care  delivered  to  patients  who  died  in  hospital  with 
acute  myocardial  infarction,  pneumonia,  or  stroke.  Explicit  criteria  used  were  taken  in 
part  from  those  developed  by  Kahn  et  al.  (1988a;  Kosecoff  et  al.,  1988;  Roth  et  al.,  1988; 


Rubenstein  et  al.,  1988;  Sherwood  et  al.,  1988).  Implicit  reviews  were  performed  on 
dictated  abstracts  of  the  medical  record  prepared  by  one  of  the  investigators.  Three 
reviewers  reviewed  all  cases.  Interrater  reliabilities  (by  ANOVA)  for  labeling  a  death  as 
preventable  were  0.55,  0.51,  and  0.11  for  cerebrovascular  accident,  myocardial 
infarction,  and  pneumonia,  respectively.  To  test  for  stability  in  ratings  over  time  a 
sample  of  45  dictations  was  reviewed  twice;  agreement  was  69  percent  for  the  45  pairs  of 
ratings.  Physicians  also  reviewed  19  photocopied  complete  medical  records  of  patients 
whose  dictated  summaries  they  had  also  reviewed;  their  ratings  for  the  complete  records 
were  compared  with  their  ratings  for  the  19  dictated  summaries,  and  the  rate  of 
agreement  was  84  percent  (16  of  19).  Reliability  forjudging  Overall  Quality  of  care,  as 
opposed  to  preventable  deaths,  was  not  reported. 

Although  these  studies  and  others  (Sheps,  1955;  Fletcher,  1965, 1974;  Payne, 
1967;  Morehead  et  al.,  1971;  Dans  et  al.,  1985;  Grimaldi  and  Micheleteie,  1985; 
Mellette,  1986)  have  introduced  a  strong  note  of  caution  into  the  implicit  review  process, 
no  better  method  of  performing  routine  record  reviews  for  inhospital  care  has  been 
developed.  Explicit  review  is  well  suited  to  large  studies  or  review  efforts  focused  on 
particular  diseases  or  clinical  situations.  Like  implicit  review,  however,  it  requires 
extensive  training  and  monitoring  of  personnel  to  provide  reliable  data.  Explicit  reviews 
become  increasingly  time-consuming  as  they  attempt  to  probe  details  of  clinical 
management.  Furthermore,  although  peer  review  has  intrinsic  validity  as  the  community 
standard,  explicit  reviewer's  validity  may  be  called  into  question  in  judging  the  particular 
circumstances  of  an  individual  case. 


2.  METHODS 


PATIENT  SAMPLE 

The  patient  sample  for  the  explicit  review  portion  of  this  study  includes  14,012 
Medicare  patients  65  years  of  age  or  older  from  297  hospitals.  The  297  sampled 
hospitals  represent  97  percent  of  the  305  hospitals  approached.  The  hospitals  were 
randomly  selected  from  30  areas  in  five  states.  States  and  areas  from  which  hospitals 
were  selected  were  chosen  to  be  representative  of  the  U.S.  population  in  terms  of  city 
size  and  geographic  region,  but  hospitals  caring  for  poor  people  were  slightly 
oversampled.  The  study  sampled  patients  from  1981  (20  percent  of  patients),  1982  (30 
percent),  and  1985/1986  (50  percent). 

Patients  hospitalized  with  congestive  heart  failure,  acute  myocardial  infarction, 
pneumonia,  cerebrovascular  accident,  and  hip  fracture  were  randomly  selected  from  lists 
of  all  Medicare  patients  hospitalized  during  the  study  years.  Lists  of  patients  were  based 
on  ICD-9-CM  codes,  but  patient  medical  records  were  then  screened  to  exclude  patients 
who  did  not  have  the  study  disease,  or  who  did  not  have  acute  symptoms  (Kahn  et  al., 
1988b;  Draper  et  al.,  1989). 

All  included  medical  records  underwent  explicit  review.  Ten  percent  of  included 
records  (a  total  of  1,366  records)  were  selected  to  undergo  implicit  review.  Stratified 
random  sampling  was  used,  with  oversampling  of  deaths  so  that  approximately  50 
percent  of  patients  whose  records  underwent  implicit  review  were  patients  who  had  died 
in  the  hospital.  This  sampling  plan  maximized  our  chance  of  identifying  patients  for 
whom  death  occurred  as  a  result  of  poor  quality.  In  the  analyses  reported  here,  data  have 
been  reweighted  to  achieve  representativeness  of  our  findings  for  all  Medicare  patients 
hospitalized  with  one  of  the  five  diseases. 

To  determine  interrater  reliability,  a  randomly  selected  sample  of  about  25  percent 
of  all  records  was  reviewed  by  two  reviewers,  and  3  percent  were  reviewed  by  all  five 
reviewers. 

PERFORMING  IMPLICIT  REVIEW 
Deldentlfying  Records 

Before  undergoing  implicit  review,  the  inpatient  records  of  all  study  patients  were 
duplicated  and  deidentified.  On  average,  it  took  two  hours  per  record  to  obliterate  all 
names  of  patients,  hospitals,  physicians,  nurses,  cities,  and  states.  Dates  were  retained. 


In  addition,  a  team  of  six  clerical  workers  supervised  by  a  physician  and  public 
health  professional  reviewed  each  medical  record  to  be  sure  the  record  was  legible 
enough  to  be  interpretable  and  complete.  The  check  for  completeness  included 
ascertaining  that  the  physicians'  notes,  order  sheets,  nurses'  notes,  vital  signs,  and 
laboratory  results  were  present  for  each  hospital  day.  Completely  illegible  microfiche 
records  were  excluded.  These  two  tasks  took  an  average  of  55  minutes  per  record. 

Physician  Time  Requirements  for  Performing  Implicit  Review 

Reviews  were  budgeted  at  30  minutes  each;  reviewers  were  instructed  to  keep  to 
this  time  limit,  although  some  reviews  might  require  less  time  and  some  more.  We 
budgeted  training  at  12  hours  per  physician. 

Reviewer  Sample 

Twenty-five  physician  reviewers  participated  in  the  study.  One  reviewer  per 
disease  was  selected  by  each  of  the  five  state  PROs  participating  in  the  study,  but  each 
reviewer  reviewed  records  from  all  states.  We  randomly  assigned  records  to  reviewers. 
No  reviewer  reviewed  patients  from  more  than  one  of  the  five  study  diseases.  All 
reviewers  were  board  certified  in  appropriate  specialties.  Internists  reviewed  records  of 
patients  hospitalized  with  congestive  heart  failure;  cardiologists  reviewed  acute 
myocardial  infarction  records;  pulmonologists  reviewed  pneumonia  records;  neurologists 
reviewed  cerebrovascular  accident  records;  and  orthopedists  reviewed  hip  fracture 
records. 

Reviewer  Training 

Implicit  review,  as  performed  in  this  study,  required  that  reviewers  be  trained  in 
the  use  of  a  structured  review  form  with  accompanying  written  guidelines.  In  training, 
we  specifically  avoided  trying  to  change  reviewers'  opinions  about  what  should  have 
been  done  in  a  given  clinical  situation,  but  we  encouraged  reviewers  to  use  a  uniform  set 
of  rating  terms  as  applied  to  predefined  aspects  of  care. 

Initial  training  of  physician  reviewers  was  carried  out  during  one  three-hour 
small-group  session.  Following  the  session,  each  physician  reviewed  two  training 
records  and  participated  in  a  half-hour  telephone  call  with  study  physicians.  Physicians 
then  reviewed  three  more  records  on  their  own  at  home;  these  records  were  discussed 
during  a  two-hour  conference  call  involving  all  five  physicians  and  two  study 
investigators.  Physician  reviewers  had  previously  been  involved  in  the  PPS  study  in  a 


one-day  session  one  and  one-half  years  before  training.  At  that  time,  they  had  reviewed 
potential  explicit  criteria  for  each  study  disease. 

During  the  initial  training  session,  our  major  goal  was  to  train  reviewers  to  think 
about  quality  of  care  in  terms  of  the  probability  that  the  care  given  maximized  (or 
worsened)  the  chance  that  the  patient  would  experience  a  good  outcome,  whatever  the 
actual  outcome  was.  The  following  training  statements  were  frequently  repeated: 


1.  Out  of  100  patients  in  this  clinical  state,  how  many  would  have  suffered  a 
worsened  outcome,  compared  to  patients  receiving  standard  to  good  care,  if 
treated  in  this  way?  For  example,  if  100  patients  with  myocardial  infarction 
were  hospitalized  in  an  orthopedic  ward,  as  patient  Smith  was,  how  many 
would  have  experienced  a  significantly  worse  outcome  compared  to  similar 
patients  admitted  to  a  coronary  care  unit?  (We  asked  reviewers  to  make  this 
judgment  whether  or  not  patient  Smith  actually  experienced  a  poor 
outcome.) 

2.  We  are  interested  in  the  overall  care  received,  not  in  who  delivered  it.  For 
example,  if  an  internal  medicine  consultant  or  a  nurse  gave  poor  care  to  an 
orthopedic  patient,  the  patient  received  poor  care,  whatever  the  level  of 
excellence  of  the  surgeon. 

3.  You,  the  reviewers,  are  the  experts.  We  will  not  give  you  "answers"  to  the 
form  or  change  your  opinions  about  appropriate  medical  practice.  Rather, 
we  will  encourage  all  reviewers  to  use  the  same  data  and  to  attach  the  same 
meaning  to  our  rating  terms  when  answering  the  SIR  questions. 


Time 

To  maximize  the  impact  of  our  teaching,  we  used  an  interactive  small  group 
format  to  ensure  the  participation  of  all  reviewers.  During  the  first  half  hour  of  the 
session,  we  began  with  open-ended  questions  eliciting  reviewer  opinions  and  experiences 
regarding  implicit  review.  We  then  gave  a  brief  didactic  introduction  to  quality  of  care, 
including  the  terms  implicit  review,  explicit  review,  structure,  process,  and  outcome.  We 
reviewed  the  tasks  to  be  performed  to  complete  the  study.  For  the  next  15  minutes,  we 
asked  reviewers  to  independently  review  a  single  duplicated  medical  record  and  to  form 
an  impression  regarding  the  quality  of  care  given.  We  then  asked  them  to  review  the 
same  record  using  our  form  (one-half  hour).  For  the  next  hour,  we  asked  each  group 


member  in  turn  to  give  his  or  her  rating  for  each  question.  We  discussed  each  question 
in  turn,  eliciting  reasons  from  reviewers  for  their  disagreements.  Our  written  guidelines 
often  helped  to  resolve  differences,  because  many  disagreements  were  based  on 
differences  in  interpretation  of  our  rules  and  definitions  rather  than  on  true  differences  of 
opinion.  When  true  differences  of  opinion  existed,  e.g.,  between  a  physician  who 
believed  all  elderly  hypoxic  pneumonia  patients  belonged  in  an  intensive  care  unit  and 
one  who  believed  that  many  such  patients  could  be  cared  for  on  a  ward,  we  did  not 
attempt  to  "correct"  either  opinion.  We  then  broke  for  15  minutes,  reconvened,  and 
repeated  the  process  for  another  medical  record  over  the  course  of  the  next  hour  and  a 
half,  for  a  total  of  four  hours. 

We  conducted  the  telephone  conference  calls  in  a  similar  manner.  The  first  call 
for  each  reviewer  included  two  study  investigators  and  one  physician  reviewer;  the 
second  included  all  physician  reviewers  for  one  of  our  diseases  and  two  study 
investigators.  (The  second  call  thus  involved  only  reviewers  for  a  single  disease;  we 
conducted  a  separate  call  for  each  of  our  five  diseases.)  Each  call  focused  on 
preassigned  and  prereviewed  medical  records.  We  reviewed  answers  to  SIR  questions 
one  at  a  time,  encouraging  dialog  and,  hence,  problem-solving. 

We  based  many  of  our  responses  to  reviewers  on  key  principles  that  provided  the 
conceptual  framework  for  our  guidelines.  These  included  the  three  broad  concepts 
discussed  above,  as  well  as  others  addressing  problems  commonly  encountered  during 
implicit  review. 

Other  Key  Principles  Emphasized  During  Training 

Reviewers  were  taught  during  training  to  "anchor"  their  ratings  to  an  agreed  upon 
definition  of  quality  categories.  For  most  items,  reviewers  were  asked  to  use  the  lowest 
response  category  when  the  care  given  would  have  been  highly  likely  to  contribute  to  a 
bad  outcome,  whether  or  not  a  bad  outcome  actually  occurred  in  that  case.  For  example, 
when  rating  the  quality  of  a  physician's  history  or  clinical  assessment,  raters  were 
instructed  to  use  the  lowest  response  category  ("very  poor")  if  the  rater,  when  asked  to 
see  the  patient  at  midnight,  would  have  to  start  from  scratch  in  the  evaluation.  A 
judgment  of  "adequate"  meant  that  most  essential  historical  and  assessment  observations 
were  included,  but  additional  data  might  be  required  for  optimal  diagnosis  and  treatment. 
"Excellent"  meant  that  all  necessary  data  were  present.  "Good"  was  somewhat  better 
than  adequate,  and  "poor"  somewhat  worse. 


We  asked  reviewers  to  judge  care  according  to  the  individual  needs  of  a  particular 
patient,  rather  than  according  to  preset  standards.  For  example,  one  question  asks 
reviewers  to  rate  the  quality  of  recorded  information  about  the  patient's  acute  disease. 
The  amount  of  information  required  in  the  patient's  recorded  history  might  vary  with  the 
clinical  presentation,  the  patient's  mental  status,  or  the  degree  of  familiarity  of  the 
physician  with  the  patient,  although  for  each  case,  some  amount  of  recorded  information 
was  necessary  to  assure  appropriate  management  and  communication  with  other 
members  of  the  medical  team.  The  amount  of  information  necessary  for  each  individual 
case  was  left  to  the  reviewer's  implicit  judgment. 

Reviewers  were  asked  to  judge  urban,  rural,  teaching,  and  nonteaching  hospitals 
using  the  same  standard.  They  were  asked  not  to  adjust  their  ratings  according  to  their 
guess  about  the  type  of  hospital  from  which  a  record  came.  Reviewers  were  asked  to 
rate  care  as  inadequate  (i.e.,  as  "poor"  or  "very  poor")  when  that  care  did  not  meet  a 
level  achievable  by  most  hospitals  and  not  to  judge  care  as  inadequate  solely  for  failure 
to  perform  extraordinary  or  controversial  procedures. 

Reviewers  were  asked  to  take  account  of  "Do  Not  Resuscitate"  (DNR)  orders. 
They  were  asked  to  judge  care  for  these  patients  according  to  their  own  implicit 
standards  for  the  care  of  DNR  patients.  Reviewers  were  asked  not  to  judge  care  more 
leniently  if  they  suspected  that  less  aggressive  treatment  was  aspired  to  by  the  physicians 
caring  for  the  patient,  when  no  explicit  statement  about  less  aggressive  care  was  present 
in  the  record. 

STRUCTURED  IMPLICIT  REVIEW  FORM  AND  GUIDELINES 

The  SIR  form  (Appendix  1,  Kahn  et  al.,  1990)  included  28  items  and  asked  the 
physician  reviewers  to  rate  specific  aspects  of  the  following:  the  process  of  physician  and 
nursing  care;  the  appropriateness  of  use  of  hospital  services;  patient  prognosis; 
treatability  of  the  patient's  condition;  preventability  of  death  when  it  occurred;  the 
quality  of  the  outcome;  and  an  overall  assessment  of  the  quality  of  care  delivered  during 
the  hospitalization.  Ratings  were  based  on  Likert  scales;  a  five-point  scale  from  very 
poor  to  excellent  was  used  for  most  of  the  items.  We  used  the  same  review  form  for 
patients  with  congestive  heart  failure,  myocardial  infarction,  and  pneumonia.  We 
modified  the  form  slightly  for  hip  fracture  and  for  cerebrovascular  accident. 
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Allowable  Data  Sources  Within  the  Medical  Record 

To  perform  implicit  reviews,  reviewers  were  instructed  to  fully  review  the  entire 
medical  record,  including  order  sheets,  laboratory  tests,  and  physician  notes,  with  the 
exception  of  nurses'  notes,  which  were  to  be  used  as  needed.  All  nurses'  notes  were 
duplicated  and  reviewers  were  asked  to  comment  on  the  initial  nurse  assessment 
performed  at  the  time  of  admission  to  the  hospital.  An  Overall  Quality  judgment 
integrates  the  multiple  components  of  hospital  care  into  one  Overall  Quality  score.  The 
SIR  form  also  measures  these  components  individually,  assuring  that  all  specified 
components  are  taken  into  account. 

We  should  explain  the  low  emphasis  our  form  places  on  the  specific  measurement 
of  the  quality  of  nursing  care.  Although  we  believe  that  the  quality  of  nursing  care  is  of 
vital  importance  in  determining  patient  outcomes,  we  were  limited  to  30  minutes  on 
average  for  physician  reviews.  This  time  limit  precluded  complete  review  of  nursing 
notes.  In  addition  we  were  not  convinced  that  physician  reviewers  were  the  appropriate 
judges  of  nursing  care.  Physicians  had  all  nurses'  notes  available  for  review,  and  could 
and  did  use  them  as  needed  to  judge  particular  incidents  during  the  hospitalization. 

In  assigning  their  ratings,  reviewers  took  into  account  the  entire  hospitalization, 
including  the  actions  (or  nonactions)  of  nonphysicians,  of  hospital  laboratories  and 
facilities,  and  of  consultants  who  may  have  been  requested  by  the  primary  care 
physician.  For  example,  a  hospitalization  for  a  hip  fracture  during  which  the  surgeon 
performed  well  could  be  downrated  as  a  result  of  the  actions  of  a  poor-quality  internal 
medicine  consultant  or  inappropriate  nursing  care. 

For  each  quality  judgment  the  reviewers  made,  allowable  data  sources  were 
specified  in  the  review  form  or  guidelines.  The  purpose  of  this  was  to  encourage 
physicians  to  assign  ratings  based  on  the  same  data  sources.  For  example,  information 
sources  allowable  for  rating  the  quality  of  information  obtained  about  the  patient's  acute 
illness  include  emergency  department  notes,  consultant  notes,  or  primary  care  notes 
written  within  24  hours  of  the  time  of  admission.  Raters  could  downgrade  records  if 
appropriate  information  was  obtained  but  was  unacceptably  late — e.g.,  when  a  neurology 
consultant  was  the  only  physician  who  indicated  the  level  of  consciousness  in  a  patient 
with  stroke  and  did  not  do  so  until  the  third  day  of  hospitalization. 
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Tlmlng  of  Hospital  Services 

A  hospitalization  can  be  conceptualized  as  a  series  of  services  delivered  during 
the  course  of  the  hospital  stay  (Table  1).  Conceivably,  the  appropriateness  with  which 
the  service  is  delivered  could  be  good  on  one  day  but  poor  on  another.  We  designed  the 
SIR  form  with  this  in  mind.  The  implicit  review  form  asked  reviewers  to  judge  some 
aspects  of  care  only  during  the  initial  24  hours  or  so  of  hospitalization — some  during  the 
initial  and  subsequent  periods  separately  and  some  by  averaging  the  appropriateness  of 
use  of  the  service  taking  into  account  the  entire  hospital  stay.  Table  1  illustrates  which 
questions  on  the  review  form  measured  the  first  24  hours  of  hospitalization  and  which 
measured  the  entire  stay. 

We  think  the  initial  24  hours  of  hospitalization  are  conceptually  different  from 
subsequent  hospital  days.  During  this  initial  period,  key  assessment  data  are  gathered 
and  key  therapeutic  decisions  are  made.  If  the  initial  assessment  is  poorly  performed, 
subsequent  decisions  may  be  flawed.  We  therefore  emphasized  this  initial  period  in 
designing  our  form.  The  period  just  before  discharge  is  another  key  time  period  where 
flawed  decisions  may  be  especially  crucial.  We  specifically  assessed  this  predischarge 
period  with  questions  on  the  appropriateness  of  the  length  of  stay  given  the  patient's 
status  at  discharge  and  the  presence  or  absence  of  instability  at  discharge. 

Services  delivered  during  the  hospitalization  can  be  viewed  as  falling  into  two 
categories.  One  category  consists  of  assessments  and  procedures  that  should  be  done  on 
all  patients  with  the  disease  being  studied.  The  second  consists  of  assessments  and 
treatments  performed  only  for  specific  indications,  such  as  an  abnormal  finding,  an 
adverse  event,  or  a  risk  factor.  We  distinguish  between  these  two  kinds  of  services  only 
for  nurse  assessment;  we  judged  nursing  assessment  as  appropriate  for  all  patients  and 
did  not  attempt  to  measure  nurse  judgment  or  response  to  individual  needs. 

Prognosis 

Two  questions  ask  physicians  to  judge  patient  prognosis  at  admission  and  the 
potential  effectiveness  of  treatment  for  the  patient's  disease.  Reviewers  were  asked  to 
base  their  judgments  on  initial  data  gathered  during  the  first  24  hours  of  admission. 
Reviewers  were  asked  to  ignore  the  actual  process  of  care  during  the  hospitalization  but 
instead  to  assume  that  good  but  not  extraordinary  care  would  be  given.  Reviewers 
judged  the  patient's  outcome  relative  to  what  might  have  been  expected  to  occur  given 
the  severity  of  the  patient's  disease. 
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Table  1 


The  Inpatient  Hospitalization:  Aspects  Evaluated  by 
Implicit  Review  by  SIR  Question  Number 


Initial  Only 
(Tirst  24  hr) 


Initial  and 
Subsequent 


At  or  Near 
Discharge 
Only 


SERVICES: 
Nurse  (cognitive) 

Assessment  Qle  — 

Therapy  —  — 

Physician  (cognitive) 
Assessment 

— Prior  and  chronic  disease  Qlabcd  — 
— Acute  disease  Q2abc  — 

Therapy  Q2abc  — 

Laboratory  tests 

"Small  ticket"  Q2abc  5dh 

(e.g.,  venous  blood  tests) 
"Middle  ticket"  Q2abc  5fg 

(e.g.,  EKG) 

Invasive  or  specialized  Q2abc  — 

(e.g.,  cardiac  cath) 

Medications  Q2abc  5j 

02  and  ventilation  Q2abc  5c 

Special  medical  personnel 

Physical  occupational  therapy  Q2abc  5e 

Respiratory  therapy  Q2abc  5b 

Consultant  physicians  Q2abc  5i 

Special  nursing  units 

CCU/ICU  Q2abc  5a,  ai 

Telemetry  Q2abc  5aii 

SUMMARY  MEASURES: 
What  was  the  patient's  life 

expectancy?  Q3  — 

How  treatable  was  the  disease?         Q4  — 
Was  the  death  preventable? 

(patients  who  died)  —  Q7 

Appropriateness  of  length  of  stay, 

and  stability  at  discharge 

(patients  discharged  alive)  — 
Patient's  outcome  compared 

with  expected  —  Q8 

Overall  quality  of  care  —  Q9,10 


Q6, 6a,  6b 


" — "  indicates  that  this  topic  is  not  addressed. 
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ADDITIONAL  DATA  SOURCES 

In  addition  to  performing  implicit  reviews  on  the  medical  records,  our  analysis 
used  data  from  the  previously  described  explicit  reviews  of  these  records  (Kahn  et  al., 
1990a).  To  evaluate  the  actual  outcomes  of  care  of  the  patients  whose  care  was  assessed 
implicitly,  we  examined  mortality  rates  30  days  after  admission.  We  were  able  to  find 
accurate  postadmission  mortality  data  from  HCFA'S  Health  Insurance  Master  file  for  92 
percent  of  our  study  sample  (Kahn  et  al.,  1990b). 

STATISTICAL  METHODS 

To  facilitate  analysis,  we  developed  implicit  process  scales  and  an  Overall  Quality 
of  care  scale.  We  used  clinical  judgment  to  group  process  items  into  scales.  Process 
scales  were  created  from  the  19  items  listed  in  questions  1-5  on  the  implicit  review  form. 
The  overall  quality  of  care  scale  was  based  on  answers  to  questions  9  and  10.  We  used 
factor  analysis  to  see  whether  our  process  groupings  made  sense  psychometrically;  final 
scales  were  quite  similar  to  our  initial  clinically  developed  scale  groupings.  We  moved 
an  item  from  one  scale  to  another  when  a  high  correlation  with  the  other  factor  indicated 
strongly  that  the  item  belonged  in  that  grouping. 

Reliability 

We  hypothesized  that  some  reviewers  might  judge  care  more  harshly,  and  some 
more  leniently,  than  others.  We  tested  for  the  significance  of  this  reviewer  effect  using  a 
one-way  analysis  of  variance.  We  then  determined  individual  reviewer  standard 
deviations  and  the  extent  to  which  these  were  normally  distributed.  We  also  tested  the 
normality  of  the  scales  directly  within  and  across  reviewers.  To  adjust  for  the  reviewer 
effect,  we  calculated  the  group  of  reviewers'  mean  scores  for  a  question  or  scale  across 
patients,  compared  it  to  the  individual  reviewer's  mean  score  for  that  question,  and 
adjusted  the  individual  reviewer's  scores  for  all  patients  up  or  down  proportionally  to 
match  the  group  mean.  We  then  used  the  adjusted  scores  in  our  analyses.  For  example, 
reviewer  Smith,  M.D.,  gave  pneumonia  patient  John  Doe  a  score  of  four  on  the  five-point 
Likert  scale  for  Item  #1  on  the  implicit  review  form.  The  average  score  for  Dr.  Smith  in 
rating  other  pneumonia  patients  for  Item  #1  was  five.  The  average  score  for  Item  #1  for 
all  reviewers  in  Dr.  Smith's  review  group  was  3.  The  calculated  score  adjusted  for  the 
reviewer  effect  would  be  (4  -  5)  +  3  =  2. 
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Interrater  Reliability 

To  determine  interrater  reliability,  we  first  determined  the  percentage  agreement 
between  reviewers  for  scales  and  items  on  the  implicit  review  form.  Percentage 
agreement,  however,  can  be  highly  misleading,  since  it  does  not  take  into  account  the 
amount  of  agreement  that  would  be  expected  on  the  basis  of  chance  alone,  given  the 
underlying  distribution  of  scores  for  the  item  or  scale  in  question.  A  70  percent  level  of 
agreement  could  be  very  good  or  totally  unacceptable.  If  reviewers  agree  70  percent  of 
the  time  on  the  year  of  birth  across  the  500  cases  they  review,  but  99  percent  of  the  cases 
were  born  in  1956,  70  percent  agreement  is  poor.  If  the  agreement  is  70  percent  when 
the  true  answers  across  the  500  cases  rated  are  well  distributed  (e.g.,  spanning  a  large 
number  of  possible  birth  years),  70  percent  may  be  very  good. 

A  frequently  used  method  for  adjusting  the  percentage  agreement  to  remove 
agreement  expected  by  chance  alone  is  to  calculate  the  Kappa  statistic.  Kappa,  however, 
is  most  appropriate  when  the  ratings  are  unordered  categorical  variables,  such  as 
diagnostic  categories  or  yes/no  items. 

In  this  study,  we  have  used  analysis  of  variance  rather  than  Kappa  to  assess 
interrater  reliability  because  we  use  ordered  scales.  For  example,  a  five  on  one  of  our 
scales  is  not  very  different  from  a  six,  but  very  different  from  a  nine;  in  assessing 
interrater  reliability,  analysis  of  variance  takes  these  relative  disagreements  into  account 
appropriately.  Before  calculating  interrater  reliability,  we  adjusted  scores  for  reviewer 
effects  as  described  above. 

Interltem  Reliability 

To  determine  whether  the  process  and  Overall  Quality  scales  we  formed 
performed  well  psychometrically,  we  determined  interitem  reliability  (Cronbach's 
alpha). 

Overall  Quality  Scale 

The  Overall  Quality  of  Care  scale  consisted  of  two  items:  "Considering 
everything  you  know  about  the  patient,  please  rate  overall  quality  of  care"  with  five 
response  categories  ranging  from  extreme,  above  standard,  to  extreme,  below  standard; 
and  "Would  you  send  your  mother  to  these  physicians  in  this  hospital?"  with  four 
response  categories  ranging  from  definitely  yes  to  definitely  no.  Responses  to  these 
questions  were  added  together  to  form  an  eight-point  Overall  Quality  of  Care  scale  from 
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2  (both  questions  answered  with  the  highest  possible  rating  of  1)  to  9  (worst  care).  Cases 
reviewed  by  multiple  reviewers  were  assigned  the  mean  adjusted  score  across  reviewers. 
We  then  divided  the  Overall  Quality  of  Care  scale  into  four  levels  of  care:  (1)  "very 
poor  care,"  a  score  of  7.5  to  9;  (2)  "poor  care,"  a  score  of  6.5  to  7.4;  (3)  "good  care,"  a 
score  of  3.5  to  6.4;  and  (4)  "very  good  care,"  a  score  of  2  to  3.4.  Cutoffs  were  chosen  to 
best  reflect  the  meaning  of  the  response  categories.  One  question  (#9)  had  three 
responses  that  according  to  our  training  and  guidelines  were  in  the  "acceptable"  care 
range,  and  two  in  the  "unacceptable"  range.  One  question  (#10),  had  two  in  the 
"acceptable"  and  two  in  the  "unacceptable"  range.  For  example,  poor  care  or  very  poor 
care  means  that  responses  to  both  of  the  Overall  Quality  questions  were  rated  in  the 
"below  standard"  or  "probably  would  not  send  my  mother"  range. 

Components  of  Variance 

We  used  components  of  variance  analysis  to  understand  reasons  for  disagreement 
among  multiple  reviewers  of  the  same  case,  and  to  examine  the  relationship  between  this 
disagreement  and  the  conclusions  that  could  be  drawn  about  the  true  quality  differences 
between  patients  assigned  particular  quality  ratings.  The  variance  analysis  of  the 
distribution  of  cases  between  very  good,  good,  poor,  and  very  poor  care  included  (1)  the 
reviewer  effect  discussed  above,  (2)  differences  of  opinion  among  reviewers,  and  (3) 
actual  differences  in  the  medical  care  delivered  to  the  patient.  We  are  clearly  most 
interested  in  the  last  component  of  the  variance.  We  used  the  variance-components 
model  to  adjust  results  of  the  Overall  Quality  scale  for  the  reviewer  effects  and  for 
differences  of  opinion  among  reviewers.  These  components  are  mathematically  related 
to  imperfect  reliability.  The  figures  generated  by  this  method  provide  a  conservative 
estimate  of  the  amount  of  care  judged  to  be  inadequate. 

In  calculating  the  "true"  population  quality  scores,  by  adjusting  for  interrater 
reliability,  we  accounted  for  the  less  than  perfect  interrater  reliability  of  each  scale.  The 
calculation  was  done  in  two  steps. 

Step  One:  We  postulated  that  the  underlying  distribution  of  true  quality  scores 
had  a  normal  distribution.  The  result  of  any  particular  review  was  grouped  into  one  of 
eight  possible  scores  for  this  scale  (2  to  9).  The  result  was  regarded  as  a  random  draw 
representing  the  population  mean  plus  a  random  reviewer/reviewer  opinion  effect. 

Step  Two:  We  estimated  the  mean  of  underlying  true  quality  scores  using  the 
average  of  the  observed  scores  and  the  variance  of  the  true  scores.  We  used  the 
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variance-components  model  to  adjust  for  the  reviewer  effect  and  the  differences  of 
opinion  between  reviewers.  These  components  are  mathematically  related  to  imperfect 
reliability.  These  concepts  can  be  represented  in  the  following  equation: 

R  =  alpha  +  beta  +  gamma 

(reviewer  effect)    ("true"  care  differences)  (opinion) 

Alpha,  beta,  and  gamma  each  have  an  associated  component  of  variance  and  add  up  to  R, 
the  observed  variance. 

CALCULATING  THE  NUMBER  OF  REVIEWERS  NEEDED 

We  calculated  the  number  of  reviewers  needed  to  achieve  given  levels  of  certainty 
about  the  results  of  a  review  of  an  individual  case.  We  did  this  using  the  Spearman- 
Brown  Prophesy  Formula.  The  minimum  value  of  "n,"  the  number  of  reviewers  needed 
to  achieve  a  specified  level  of  certainty  that  the  score  was  correct,  depends  on  the 
distribution  of  true  quality  scores  as  well  as  the  definition  of  "poor"  and  the  reliability  of 
the  instrument. 

Using  this  method  of  estimating  the  true  rating  of  quality  of  care  for  the 
population  under  study,  we  called  the  medical  care  delivered  "poor"  if  the  true  rating 
exceeded  6.5  on  our  scale.  Thus,  although  we  did  not  observe  this  true  score,  we 
estimated  the  probability  that  the  score  was  truly  poor  on  the  basis  of  the  actual  reviews 
we  performed  and  the  properties  of  the  normal  approximation. 

We  used  linear  regression  to  evaluate  whether  poor  quality  of  care  measured 
implicitly  by  the  Overall  Quality  scale  predicted  death  within  30  days,  after  adjusting  for 
patient  sickness.  (Logistic  regression  produced  similar  results.)  To  adjust  for  patient 
sickness  at  admission,  we  used  both  implicitly  and  explicitly  measured  sickness.  The 
implicit  sickness  at  admission  measures  are  life  expectancy  (Question  #3)  and  treatability 
of  the  patient's  disease  (Question  #4)  (see  Appendix  D).  We  describe  our  explicit 
measure  elsewhere  (Keeler  et  al.,  1990). 

We  also  used  linear  regression  to  determine  whether  reviewers  assigned  quality 
scores  differently  when  patients  died  in,  as  compared  with  outside,  the  hospital.  To 
account  for  the  difference  between  earlier  and  later  deaths,  we  controlled  for  the  number 
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of  days  between  hospital  admission  and  day  of  death  in  our  analyses  (proportional 
hazards  model). 
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3.  RESULTS  PART  I:  PERFORMANCE  OF  THE  SIR  METHOD 


PATIENT  SAMPLE 

A  total  of  275  records  for  acute  myocardial  infarction,  278  records  for  congestive 
heart  failure,  273  records  for  pneumonia,  270  records  for  cerebrovascular  accident,  and 
270  records  for  hip  fracture  were  reviewed,  for  a  total  of  1,366  records.  This  was  93 
percent  of  records  selected  for  implicit  review.  The  remaining  7  percent  were  not 
included  because  of  late  return  from  the  hospital  or  reviewer  (2  percent),  because  the 
microfiche  copy  was  unreadable  (3.5  percent),  and  for  miscellaneous  reasons  (1.5 
percent).  Of  the  1,366  records  that  were  reviewed,  993  were  reviewed  once,  333  were 
reviewed  twice,  33  were  reviewed  five  times,  and  seven  (hip)  records  were  reviewed 
four  times,  for  a  total  of  1,852  reviews  performed.  To  evaluate  our  randomization,  we 
compared  the  implicit  review  sample  to  the  larger  explicit  review  sample  from  which  it 
was  drawn.  The  explicit  sample  is  known  to  closely  match  national  data  for  patient 
demographic  and  hospital  characteristics.  Study  results  show  that  the  implicit  and 
explicit  samples  were  similar  in  composition  in  terms  of  age,  sex,  race,  and  type  of 
hospital  (Table  2). 


Table  2 

Results:  Patient  Sample 

Implicit 
n=l,366 
(%) 

Explicit 
n=14,012 

(%) 

Demographic  Characteristics  of  Patients 

Age  a  80  43 
Female  54 
Non-white  16 

41 
57 
19 

Characteristics  of  Hospitals 

Rural  20 
Any  teaching  36 
County  13 
Serves  high  percent  Medicaid  patients  16 

21 
35 
14 
19 
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IMPLICIT  REVIEW  QUALITY  SCALES 

Our  final  scales  and  their  component  items  are  listed  in  Appendix  D. 

REVIEWER  EFFECT 

We  found  a  significant  reviewer  effect  for  most  diseases  (p  <  0.01  by  ANOVA  for 
all  diseases  except  cerebrovascular  accident,  for  which  there  were  no  significant 
reviewer  effects).  That  is,  some  reviewers  gave  significantly  lower  quality  scores  across 
all  cases  they  reviewed  than  did  others.  The  reviewer  effect  accounted  for  about  4 
percent  of  the  variance  in  scores  (1  percent  for  cerebrovascular  accident,  8  percent  for 
hip  fracture,  4  percent  for  congestive  heart  failure,  6  percent  for  acute  myocardial 
infarction,  and  1  percent  for  pneumonia).  Subsequent  analyses  reported  here  have  been 
adjusted  for  the  reviewer  effect. 

However,  this  is  not  the  full  story.  There  are  significant  though  small  differences 
between  the  standard  deviations  of  the  reviewers'  mean  quality  scores.  So  it  appears  that 
some  reviewers  were  more  extreme  in  their  approach  than  others.  Although  it  is  possible 
that  the  adjustment  process  could  be  refined  to  take  account  of  this,  it  would  not  have 
much  impact  on  the  results.  There  was  no  pattern,  for  example,  in  which  reviewers  with 
lower  mean  values  had  lower  or  higher  standard  deviations. 

OVERALL  QUALITY  SCALE 

Our  results  indicated  that  the  Overall  Quality  scale  accurately  summarized  the 
results  of  the  remainder  of  the  implicit  review  form.  The  two  items  in  the  Overall 
Quality  scale  were  highly  correlated  with  each  other  and  with  responses  to  the  more 
detailed  quality  questions  in  other  sections  of  the  form.  The  Pearson  correlation 
coefficient  between  the  two  questions  that  constitute  the  Overall  Quality  scale  was  0.86. 
Correlations  between  ratings  on  the  Overall  Quality  scale,  the  process  scales,  and  the 
Preventable  Death  item  were  also  high  (Table  3).  For  example,  the  high  negative 
correlation  of  the  Overall  Quality  scale  with  Preventable  Death  indicates  that  physicians 
strongly  downrated  overall  quality  of  care  for  records  when  they  believed  that  a 
preventable  death  had  occurred.  Overall,  responses  to  the  more  detailed  items  and  scales 
on  the  review  form  accounted  for  about  56  percent  of  the  variance  (R2)  in  the  Overall 
Quality  scale.  This  reflected  an  R2  of  83  percent  for  congestive  heart  failure,  acute 
myocardial  infarction,  and  pneumonia,  an  R2  of  51  percent  for  hip  fracture,  and  an  R2  of 
73  percent  for  cerebrovascular  accident. 
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Table  3 


Correlations  of  Individual  Implicit  Review  Scales 
and  Items  with  Overall  Quality  of  Care  Scale 


Individual  Implicit 

Correlation  Coefficient 

Type  of  Scale 

Review  Scales 

or  Item 

and  Items 

CHF,  AMI,  PNE 

HIP 

CVA 

±  11  y  oiwxaii 

MT}  functional  anH  phrnnir 

ivxj./  AUiiwiiuiiai  aiiu  will uiiiv 

disease  assessment 

0.59 

0.51 

0.66 

MD  acute  disease  assessment 

and  plan 

0.82 

0.59 

0.82 

Laboratory  evaluation 

performed 

0.61 

0.63 

0.53 

Treatment  performed 

0.68 

0.72 

0.71 

Nurse 

RN  admission  assessment 

0.24 

0.19 

Outcomes 

Quality  of  the  outcome 

0.34 

0.42 

0.40 

Length  of  stay 

0.56 

0.43 

0.48 

Preventable  death 

-0.66 

-0.57 

-0.64 

Sickness  at  admission 

Patient  prognosis  at  admission 

0.02 

-0.01 

There  were  two  exceptions  to  the  high  correlations  between  the  Overall  Process 
scale  and  other  scales.  As  we  had  anticipated,  given  the  minimal  review  of  nurses'  notes 
included  in  our  review  protocol,  the  correlation  between  the  Overall  Quality  scale  and 
nurse  assessment  was  quite  low.  The  correlation  with  patient  prognosis  was  also  low,  as 
we  had  hoped,  indicating  that  physicians  were  not  systematically  rating  quality  of  care  on 
the  basis  of  patient  sickness  at  admission. 

RELIABILITY 

To  evaluate  the  psychometric  characteristics  of  SIR,  we  tested  interitem  and 
interrater  reliabilities,  after  adjusting  for  reviewer  effect. 

Table  4  lists  the  implicit  review  scales  and  their  interitem  and  interrater 
reliabilities.  Appendix  E  lists  the  reliabilities  for  implicit  review  items.  We  have  also 
included  interrater  reliabilities  for  some  of  the  items  not  included  in  the  scales. 
Calculations  were  not  applicable  when  the  relevant  items  were  not  included  on  the  form 
for  a  particular  disease,  or,  for  interitem  reliability,  when  there  was  only  one  item 
addressing  the  indicated  topic.  For  scales,  alpha  or  interitem  reliabilities  were  0.8  to  0.9. 
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Reliability  or  Implicit  Review  for  Five  Diseases 


Cronbach's 

Number  of  Items  Interitem  Interrater 

on  SIR  Fonn  Reliability  Reliability 


CHF,  AMI, 

CHF,  AMI, 

CHF,  AMI, 

Groups  of  Items  on  SIR  Form 

PNE 

HIP 

CVA 

PNE 

HIP 

CVA 

PNE 

HIP 

CVA 

MD  function  and  chronic 

disease  assessment 

A 

A 

z 

ft  so 
u.oy 

n  so 
u.oy 

ft  90 
U.oZ 

ft  AO 

u.oy 

ft  71 
U./l 

ft  AO 

u.oy 

MD  acute  disease  assessment 

and  plan 

•1 
J 

z 

A 

ft  Q"X 

ft  A7 
u.o  / 

ft  QO 

u.yz 

ft  ^0 

ft  7S 

u.zj 

ft  AO 

u.oy 

Laboratory  evaluation 

A 

A 

■1 
J 

ft  90 
U.OZ 

ft 

U.  /J 

ft  70 

u.  /y 

ft  AO 

ft  00 
u.zz 

ft  OA 

u.  m 

Treatments 

O 

A 

•I 
J 

ft  fiA 
U.OO 

ft  AO 

u.oy 

A  81 
U.ol 

ft  Aft 
U.hU 

ft  io 

ft  A^ 
U.Oj 

Respiratory  care 

3 

3 

0.84 

0.91 

0.44 

0.42 

Quality  of  surgery 

3 

0.95 

0.90 

CVA  therapy 

2 

0.76 

0.47 

Special  neurologic  testing 

3 

0.45 

0.45 

RN  assessment 

1 

1 

034 

0.50 

Appropriateness  of  length 

of  stay 

1 

0.13 

0.58 

0.42 

Preventable  death 

1 

0.16 

035 

0.50 

Life  expectancy 

0.64 

0.73 

Treatability 

0.57 

0.57 

Quality  of  the  outcome 

1 

0.48 

0.74 

0.69 

Function  at  discharge 

0.80 

Appropriateness  of  surgery 

1 

030 

Overall  quality  of  care 

2 

2 

2 

0.93 

0.87 

0.94 

0^2 

0.44 

0.66 

NOTE:  Dashes  indicate  that  the  notation  is  not  applicable. 


Interrater  reliabilities  for  scales  were  mostly  between  0.4  and  0.7,  a  range  considered 
adequate  for  comparing  groups  of  people,  but  did  not  reach  the  0.8  to  0.9  level 
considered  desirable  for  rating  a  single  individual. 


COMPONENTS  OF  VARIANCE  IN  QUALITY  SCORES 

To  estimate  how  much  of  the  difference  in  quality  of  care  ratings  was  due  to 
actual  differences  in  the  care  the  patient  received  compared  with  methodologic  artifact, 
we  evaluated  the  percentage  of  the  variance  in  Overall  Quality  of  care  found  in  our 
sample  as  a  whole  that  was  accounted  for  by  each  of  our  three  postulated  sources  of 
variance.  These  were  the  reviewer  effect,  the  effect  of  imperfect  interrater  reliability, 
and  the  effect  of  true  differences  in  the  quality  of  care  provided  to  different  patients. 
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Overall,  the  reviewer  effect  accounted  for  about  0.9  to  7.6  percent  of  the  variance  in 
Overall  Quality  scores  between  records;  differences  of  opinion  regarding  case-specific 
management  between  reviewers  accounted  for  38  to  65  percent,  and  differences  in 
characteristics  of  the  care  the  patient  received  accounted  for  36  to  63  percent  (Table  5). 
Thus,  the  range  of  quality  of  care  scores  in  our  sample  reflected,  to  a  substantial  extent, 
true  differences  in  the  care  delivered  to  patients. 

LINKING  PROCESS  TO  OUTCOME 

To  further  validate  our  SIR,  we  determined  whether  the  process  of  care  for  a 
patient  with  a  given  sickness  at  admission  could  be  linked  to  the  patient's  mortality.  We 
found  (Table  6)  that  death  within  30  days  of  hospital  admission  was  strongly  correlated, 
at  each  level  of  sickness,  to  the  quality  of  care  as  rated  by  implicit  review. 

Better  Overall  Quality  of  care  ratings  were  also  linked  with  lower  mortality  rates 
at  180  days  after  admission  and  for  posthospital  deaths.  However,  in  neither  of  these 
instances  were  these  data  significant  at  the  0.05  level.  The  lack  of  significance  is  at  least 
in  part  due  to  the  small  number  of  posthospital  deaths  (i.e.,  half  of  our  sample  was 
selected  to  represent  inhospital  deaths),  so  we  had  many  fewer  posthospital  deaths  to 
evaluate. 

To  guard  against  the  possibility  that  our  reviewers  were  biased  in  their 
assessments  by  knowledge  that  the  patient  had  died,  we  compared  quality  ratings  for 
patients  who  died  in  and  outside  the  hospital  on  each  postadmission  day,  using  the 
proportional  hazards  model  analysis  strategy  described  in  Sec.  2.  We  found  that  the 

Table  5 


Percentage  of  Variance  in  Reviewer  Judgments 
Accounted  for  by  Three  Scores  of  Variance 


Differences  of 

Reviewer 

Opinion  Among 

Differences  in 

Effect 

Reviewers 

Care  Provided 

CVA 

1 

38 

63 

HIP 

8 

65 

32 

CHF 

4 

59 

37 

AMI 

6 

49 

45 

PNE 

1 

39 

59 
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Table  6 


Predicting  Inhospital  Death,  30-Day  Death,  and  180-Day  Death 
Using  Quality  of  Care  Scores  from  Implicit  Reviews 
For  Five  Diseases  Combined 


Death 

within 

180 

-Day 

Inhospital 

30  days 

Posthospital 

Death8 

of  Admission13 

Deathc 

Coef. 

T 

Coef. 

T 

Coef. 

T 

Overall  quality 

of  care 

0.19 

4.3* 

0.19 

4.6* 

0.08 

1.7+ 

RAND  sickness  at 

admission  by 

explicit  review0 

1.10 

15.6* 

1.00 

15.7* 

0.59 

7.8* 

Severity  of  illness 

by  implicit  review 

0.16 

3.8* 

0.16 

3.6* 

0.09 

1.8+ 

Treatability 

by  implicit  review 

0.81 

6.1* 

0.76 

5.6* 

0.50 

3.2*' 

Life  expectancy 

-1.00 

-9.8* 

-1.00 

-9.7* 

-0.51 

-4.0* 

*  p  <  0.0001. 
**p<  0.001. 
+  p  <  0.10. 

a  The  dependent  variable  includes  only  those  patients  who  died  in  the  hospi- 
tal. 

k  The  dependent  variable  includes  only  those  patients  who  died  within  30 
days  of  admission  regardless  of  whether  the  patient  died  in  the  hospital  or  after 
discharge. 

c  The  dependent  variable  includes  only  those  patients  who  died  after 
discharge  from  the  study  hospitalization  and  within  180  days  of  admission  for 
the  study  hospitalization. 

d  See  Keeler  et  al.  (1990). 


relationship  between  quality  of  care  and  death  was  as  strong  for  the  group  who  died 
outside  the  hospital  as  for  those  dying  in  the  hospital  when  matched  for  postadmission 
day.  That  is,  when  we  compared  patients  who  died  on  a  particular  day  inside  and  outside 
the  hospital  to  see  if  we  observed  a  difference  in  quality,  the  average  pairwise  difference 
was  not  significant.  Controlling  for  hospital  day  number,  quality  of  care  scores  were  the 
same  for  patients  who  died  in  or  outside  the  hospital,  using  multiple  regression.  We 
found  that  quality  ratings  for  people  who  had  died  were  only  mildly  worse,  overall,  than 
those  for  people  who  survived. 

Tables  7  and  8  summarize  the  relationships  between  quality  of  care  and  mortality 
in  terms  of  the  percentage  of  people  who  died  within  30  days  of  admission  and  those 
discharged  alive  who  died  within  180  days  of  admission.  Patients  receiving  very  poor 
quality  care,  after  adjustment  for  sickness,  were  more  likely  to  die  in  all  analyses. 
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Table  7 


Percentage  of  Patients  Who  Died  Within  30  Days  of  Hospitalization 
Depending  Upon  the  Overall  Quality  of  Care  Received 
and  the  Level  of  Sickness  at  Admission8 


Least  Sick 

Most  Sick 

Quality  of  Care 

I 

II 

III 

IV 

All 

by  Implicit  Review 

(n=294) 

(n=301) 

(n=300) 

(n=302) 

(n=l,197) 

Very  good  (n  =  189) 

1 

5 

13 

46 

17 

Good  (n  =  675) 

3 

6 

13 

34 

14 

Poor  (n  =  164) 

2 

4 

8 

49 

13 

Very  poor  (n  =  169)b 

6* 

17** 

34** 

55** 

30** 

Total  patients  (n  =  1,197) 

3 

7 

15 

41 

17 

Relative  risk  of 

30-day  death: 

2.43* 

3.09** 

2.79** 

1.42** 

2.08** 

Very  poor  vs. 
all  othersb,c 


a  Sickness  at  admission  is  measured  explicitly  (Keeler  et  al.,  1990).  We  constructed 
severity  quartiles  for  each  of  the  five  diseases  by  dividing  patients  into  four  equal  groups 
ranging  from  the  least  sick  25  percent  to  the  most  sick  25  percent. 

30-day  death  rate  for  patients  with  very  poor  care  differs  from  the  corresponding  rate  for 
all  other  patients  by  an  amount  whose  significance  is  indicated  by  asterisks:  *  =  p  <  0.05,  ** 
=  p  <  0.01  (i.e.,  relative  risks  are  significantly  different  from  1). 

0  We  found  no  evidence  that  reviewers  systematically  downrated  quality  for  patients  who 
died  in  the  hospital.  Stratifying  by  day  of  death,  inhospital  and  out-of-hospital  deaths  had 
equivalent  quality  scores.  The  sample  size  for  out-of-hospital  deaths  was  small,  but  the  trend 
was  the  same  as  for  inhospital  deaths;  i.e.,  patients  with  poor  quality  scores  had  an  increased 
risk  of  death  (p  =  0.07). 


Table  8 


Percentage  of  Patients  Discharged  Alive  Who  Died  Outside 
of  the  Hospital  within  180  Days  by  Overall  Quality  of 
Care  Received  and  the  Level  of  Sickness  at  Admission 


Quality  of  Care  by 
Implicit  Review 

Least  Sick 
(I  or  II) 
n  =  572 

Most  Sick 
(III  or  IV) 
n  =  455 

All 
n  =  1,027 

Very  good  or  good  (n  =  757) 

9 

25 

16 

Poor  or  very  poor  (n  =  270) 

12 

32 

21 

Total  patients  (n  =  1,027) 
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COMPARISON  BETWEEN  IMPLICIT  AND  EXPLICIT  REVIEWS 

To  compare  implicit  review  and  explicit  review  evaluations  of  quality  of  care,  we 
used  the  four  levels  of  care  measured  by  our  Overall  Quality  of  Care  scale.  As  discussed 
in  Sec.  2,  these  levels  were:  (1)  "very  poor  care,"  a  score  of  7.5  to  9;  (2)  "poor  care,"  a 
score  of  6.5  to  7.4;  (3)  "good  care,"  a  score  of  3.5  to  6.4;  and  (4)  "very  good  care,"  a 
score  of  2  to  3.4.  Explicit  scales  are  scored  from  0  to  100  based  on  the  percentage  of 
applicable  explicit  criteria  adhered  to  in  the  patient's  medical  record.  A  higher  score 
indicates  better  process  by  explicit  criteria  (Kahn  et  al.,  1990b). 

Explicit  process  scale  scores  were  consistently  lower  (worse)  for  the  group  of 
patients  who  had  received  poor  or  very  poor  care  as  judged  by  implicit  review.  Results 
of  comparisons  with  three  of  the  explicit  scales  are  presented  in  Table  9.  "MD  Process 
Day  1  and  Day  2"  measures  whether  any  physician  recorded  specified  information  about 
the  history  and  physical  examination  on  day  1  or  2  of  the  hospitalization.  "MD  Process 
Day  3"  asks  whether  any  physician  recorded  such  information  on  day  3  of  the 
hospitalization.  "Abnormal  Lab  Rechecked"  asks  whether  highly  abnormal  laboratory 
results,  such  as  a  potassium  <  3.0  mmol/L,  were  ever  rechecked  during  the 
hospitalization. 

Table  9 

Comparison  Between  Implicit  and  Explicit  Process 
Scale  Scores  for  Five  Diseases 


Average  Process  on  Explicit  Scales  (O-100)3 


MD  Process 

MD  Process 

Abnormal 

Overall  Implicit 

Days  1, 2b 

Day  3^ 

Lab  Rechecked' 

Quality  Scale 

n  =  1,849 

n=  1,316 

n  =  909 

Very  good  care 

88 

76 

73 

Good  care 

75 

59 

68 

Poor  care 

69 

48 

57 

Very  poor  care 

63 

47 

55 

aA  higher  score  indicates  better  performance  for  explicit  review 
scales. 

Significance  of  differences  for  all  pairwise  comparisons  is  p  < 
0.001. 

This  scale  was  not  measured  for  hip  fracture. 

dSignificance  of  differences  for  all  pairwise  comparisons  is  p  < 
0.001  except  for  "poor"  versus  "very  poor,"  which  is  not  significant. 

'Significant  differences  for  "very  good"  versus  "poor"  or  "very 
poor,"  and  for  "good"  versus  "very  poor"  (p  <  0.01). 
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For  MD  Process  Day  1/Day  2,  and  Day  3,  the  "very  good"  and  "good"  care 
groups  based  on  implicit  review  had  significantly  different  explicit  scores  from  each 
other  (p  <  0.001)  and  from  the  "poor"  and  "very  poor"  groups.  For  the  "Abnormal  Lab 
Rechecked"  scale,  "good"  and  "very  good"  could  not  be  distinguished  on  the  explicit 
scale  from  each  other,  but  scored  significantly  higher  than  did  "poor"  or  "very  poor" 
cases  (p  <  0.01).  The  "poor"  and  "very  poor"  groups,  on  the  other  hand,  could  not  be 
distinguished  from  each  other  based  on  explicit  process  scores.  These  results  held  when 
patients  discharged  alive  and  those  discharged  dead  were  analyzed  separately. 

Overall,  out  of  the  67  possible  comparisons  between  explicit  process  scores  and 
implicit  review  scores,  54  (81  percent)  had  higher  scores  for  cases  that  were  rated 
"good"  or  "very  good"  than  "poor"  or  "very  poor,"  nine  (13  percent)  of  the  scales  had 
equivalent  scores  for  the  two  groups,  and  the  remaining  four  (6  percent)  had  lower 
(worse)  explicit  process  scores  among  the  cases  rated  better  on  implicit  review.  This 
pattern  of  results  is  highly  significant  (p  <  0.001). 

NUMBER  OF  REVIEWERS  NEEDED  TO  IDENTIFY  POOR  CARE 

To  determine  how  hospitals  and  review  organizations  could  use  our  results  to 
identify  poor  quality  care,  we  used  the  results  of  our  reliability,  validity,  and  prevalence 
results  to  calculate  the  number  of  reviewers  necessary  to  answer  two  questions  that  are 
important  for  evaluating  the  results  of  a  medical  record  review.  The  first  question  is, 
how  many  reviewers  are  needed  to  be  80,  85,  or  95  percent  certain  that  care  that 
reviewers  call  "very  poor"  is,  in  reality,  either  "poor"  or  "very  poor."  In  Table  10,  we 
see  that  the  number  of  reviewers  needed  reaches  a  maximum  of  4  to  6  reviewers.  This 
means  that  if  the  mean  quality  score  for  a  case,  determined  by  calculating  the  mean 
rating  for  six  or  more  reviewers,  falls  within  the  "very  poor"  category,  we  can  be  95 
percent  certain  that  the  care  given  is  either  "poor"  or  "very  poor." 

The  second  question  asks,  how  many  reviewers  are  needed  to  be  80,  85,  or  95 
percent  certain  that  care  that  is  judged  by  reviewers  as  "good"  care  is,  in  reality,  not 
"very  poor"  care.  Our  results  show  that  this  judgment  can  be  made  using  only  one 
reviewer  with  98  percent  accuracy. 
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Table  10 

Number  of  Reviewers  Needed  to  Identify  Very  Poor 
Care  or  Good  Care 


%  Certainty  Desired 


To  be  sure  that: 

80 

85 

95 

"Very  poor  care" 

CHF 

3 

4 

6 

is  poor  or  very  poor 

PNE 

2 

2 

4 

AMI 

2 

3 

5 

CVA 

2 

2 

4 

HIP 

4 

5 

"Good  or  very 

CHF 

good  care"  is 

PNE 

not  very  poor 

AMI 
CVA 
HIP 
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4.  RESULTS  PART  II:  CHANGES  IN  QUALITY  OF  CARE 
BETWEEN  1981 -1982  AND  1985-1986 
(BEFORE  AND  AFTER  MEDICARE  PROSPECTIVE  PAYMENT) 


Reviewers  judged  the  quality  of  care  for  the  five  study  diseases  to  be  generally 
good  (Table  11).  Many  more  cases  (82  percent,  for  all  five  diseases  combined)  were 
considered  to  be  good  or  very  good  than  were  judged  to  be  poor  or  very  poor  (18 
percent).  For  congestive  heart  failure,  however,  27  percent  of  the  patients  were  judged 
to  have  received  poor  or  very  poor  care;  for  cerebrovascular  accident,  24  percent 
received  poor  or  very  poor  care,  for  acute  myocardial  infarction,  23  percent  received 
poor  or  very  poor  care,  and  for  pneumonia  13  percent  received  poor  or  very  poor  care. 
Patients  with  hip  fracture  were  least  often  judged  to  receive  poor  or  very  poor  care  (4 
percent). 

Table  12  shows  how  the  proportion  of  patients  judged  as  receiving  poor  or  very 
poor  care  changed  over  time.  For  all  diseases,  reviewers  judged  the  cases  from  the 
1985-1986  time  period  more  favorably  than  those  from  the  1981-1982  time  period.  This 
difference  was  statistically  significant  at  the  p  <  0.01  level  for  acute  myocardial 
infarction,  pneumonia,  and  cerebrovascular  accident.  The  difference  was  significant  at 

Table  11 

The  Quality  of  Medical  Care  in  the  United  States  for  Five  Diseases 
as  Measured  by  Implicit  Review8 


Percentage  Distribution  of  Patients  on  the 
Overall  Quality  of  Care  Scale 


Disease 

(No.  of 
Patients) 

Poor  or 
Very  Good 

Good 

Poor 

Very  Poor 

Very  Poor 

Congestive  heart  failure 

278 

4 

68 

19 

8 

27 

Acute  myocardial  infarction 

275 

10 

69 

15 

7 

23 

Pneumonia 

273 

8 

79 

11 

2 

13 

Cerebrovascular  accident 

270 

12 

64 

15 

9 

24 

Hip  fracture 

270 

3 

93 

4 

0 

4 

All  diseases  combined 

1366 

7 

75 

13 

5 

18 

Reviewer  effects  are  removed  (see  the  text). 
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Table  12 


Percentage  of  Patients  with  Poor  or  Very  Poor  Care 
by  Study  Period  for  Five  Diseases8 


Pre-PPS 

Post-PPS 

Total 

(1981-82) 

(1985-86) 

No.  of 

(Total 

(Total 

Difference 

Disease 

Patients 

n  =  669) 

n  =  697) 

(Post-Pre)b 

Congestive  heart  failure 

278 

35 

22 

-13* 

Acute  myocardial  infarction 

275 

33 

13 

-20** 

Pneumonia 

273 

17 

10 

-7+ 

Cerebrovascular  accident 

270 

36 

16 

-20** 

Hip  fracture 

270 

5 

2 

-3 

All  diseases  combined 

1366 

25 

12 

-13*** 

'Reviewer  effects  are  removed. 

bSignificance  of  pre-post  PPS  difference  is  indicated  by  the  following  symbols:  +  = 
p  <  0.10,  *  =  p  <  0.05,  **  =  p  <  0.01,  ***  =  p  <  0.001.  The  pattern  of  significance 
takes  into  account  not  only  the  size  of  the  pre-post  PPS  difference  but  also  differences 
in  reliabilities  by  disease. 


the  p  <  0.06  level  for  cerebrovascular  accident  and  was  not  significant  for  hip  fracture. 
Virtually  all  scales  and  items  on  the  form  were  rated  higher  in  the  post-PPS,  1985-1986 
period,  and  most  pre-post-PPS  differences  were  significant  at  the  p  <  0.001  level  when 
we  aggregated  across  the  five  diseases. 

Table  13  summarizes  changes  over  time  in  both  the  appropriateness  of  length  of 
stay  and  in  instability  at  discharge  for  the  five  study  diseases.  Physicians  judged  lengths 
of  stay  to  be  more  appropriate  in  the  1985-1986  period  than  in  the  1981-1982  period  for 
three  diseases  (acute  myocardial  infarction,  pneumonia,  and  cerebrovascular  accident), 
but  slightly  less  appropriate  for  two  diseases  (congestive  heart  failure  and  hip  fracture). 
The  distribution  of  inappropriateness  between  stays  that  were  inappropriately  too  long 
and  inappropriately  too  short  also  changed,  with  fewer  of  the  inappropriate  stays  during 
the  1985-1986  period  being  due  to  inappropriately  long  stays  and  more  being  due  to 
inappropriately  short  stays.  This  was  particularly  true  of  congestive  heart  failure,  where 
34  percent  of  patients  were  judged  to  have  been  discharged  too  soon  during  the  later 
period  compared  to  24  percent  in  the  earlier  period,  and  for  hip  fracture,  where  23 
percent  of  patients  were  judged  to  have  been  discharged  too  soon  in  the  later  period, 
compared  to  12  percent  in  the  earlier  period.  For  these  two  diseases,  in  the  post-PPS 
period,  instability  at  discharge  was  judged  to  be  present  in  nearly  half  of  the  patients 
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Table  13 


Changes  in  the  Percentage  of  Patients  Judged  by  Reviewers  to  Have  Inappropriate 
Length  of  Stay  for  Five  Diseases:  Pre-PPS  (1981-82)  Versus  Post-PPS  (1985-86)8 


Probably  or 
Definitely 
Inappropriate 
Length  of  Stay 


Pre 

Post 

Pre 

Post 

Pre 

Post 

Pre 

Post 

Disease 

(ft) 

(%) 

(%) 

(ft) 

(%) 

(%) 

(ft) 

(%) 

Congestive  heart  failure  (n  =  247) 

34 

35 

10 

3* 

24 

32 

12.5 

17.3 

Acute  myocardial  infarction  (n  =  212) 

30 

11* 

16 

3* 

14 

8 

2.1 

0.0 

Pneumonia  (n  =  235) 

16 

12 

10 

5* 

6 

7 

0.6 

5.1* 

Cerebrovascular  accident  (n  =  216) 

42 

25* 

32 

17* 

10 

8 

0.6 

1.6 

Hip  fracture  (n  =  258) 

27 

28 

15 

5* 

12 

23* 

3.6 

9.4 

All  diseases  combined  (n  =  1168) 

30 

23* 

17 

7* 

13 

17 

4.0 

7.1* 

Too  Short  and 
Unstable  at 

Too  Long    Too  Short  Discharge15 


*p  <  0.05  for  pre-post  PPS  comparison. 
"Includes  only  patients  discharged  alive. 

bPatients  judged  unstable  at  discharge  are  a  subset  of  all  patients  with  too  short  lengths  of  stay.  The  per- 
centages given  indicate  the  proportion  of  the  entire  sample  with  the  disease  who  were  both  discharged 
unstable  and  had  a  too-short  stay. 


discharged  too  soon.  Overall,  the  fraction  of  hospitalizations  rated  as  too  short  and 
unstable  at  discharge  almost  doubled  from  pre-  to  post-PPS  (4.0  percent  to  7.1  percent, 
p  <  0.05). 
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5.  DISCUSSION 


Over  the  past  20  years,  quality  of  care  has  been  measured  by  both  implicit  and 
explicit  measures  (Donabedian,  1980, 1982).  Implicit  review  of  the  medical  record, 
however,  has  become  the  community  gold  standard  for  making  final  quality  judgments 
(Dippe  et  al.,  1989).  Research  has  demonstrated  that  this  approach  to  quality  of  care 
assessment  has  value  (Peters,  1972;  Dans  et  al.,  1985),  but  few  studies  have  rigorously 
measured  the  performance  of  the  implicit  review  technique  (Richardson,  1972;  Brook, 
1973;  Dubois  et  al.,  1987b).  We  have  retooled  this  classical  method  for  performing 
quality  review  and  have  found  that,  in  terms  of  reliability,  our  structured  approach  to 
implicit  review  performs  as  well  as  many  other  clinical  tests  (Koran,  1975a,  1975b). 

We  have  also,  for  the  first  time,  demonstrated  a  process-outcome  link  for  implicit 
review.  We  tested  the  validity  of  implicit  review  by  determining  whether  poor  quality  of 
care  detected  by  this  method  led  to  worse  outcomes  (inhospital  or  posthospital  death). 
We  found  convincing  evidence  that  poor  care  is  associated  with  an  increased  frequency 
of  bad  outcomes,  particularly  if  the  care  is  in  the  "very  poor"  range.  This  finding  is 
consistent  with  the  "threshold  effect"  observed  in  prior  studies  (Rubenstein  et  al.,  1977). 

It  is  possible  that  reviewer  knowledge  of  deaths  in  the  hospital  influenced  their 
ratings  of  quality.  However,  our  training  emphasized  avoidance  of  post-hoc  attribution 
of  blame.  We  trained  our  reviewers  to  think  in  epidemiologic  terms — what  would  the 
likelihood  of  a  poor  outcome  have  been  in  100  patients  treated  in  this  way.  Reviewers 
were  to  judge  good  outcomes  just  as  harshly  as  poor  outcomes  if  the  treatment  would 
have  been  likely  to  produce  negative  outcomes  in  a  significant  number  of  people  treated 
in  that  way.  Of  our  12  hours  of  training,  perhaps  the  majority  of  the  time  was  spent  on 
this  subject. 

Evidence  of  our  success  comes  from  the  fact  that  the  Overall  Quality  scores  for 
people  who  died  were  only  slightly  lower  than  those  for  people  who  survived  (reflecting 
the  fact  that  many  people  who  died  were  assigned  good  process  scores  and  many  people 
who  lived  received  poor  process  scores).  We  found  no  evidence  that  reviewers 
systematically  downrated  quality  for  patients  who  died  in  the  hospital.  Stratifying  by  day 
of  death,  inhospital  and  out-of-hospital  deaths  had  equivalent  quality  scores.  In  addition, 
we  regressed  inhospital  death  on  implicit  process  and  postdischarge  death  on  implicit 
process.  When  the  small  sample  size  for  out-of-hospital  deaths  is  taken  into 
consideration,  the  results  from  both  analyses  are  similar. 
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We  compared  implicit  and  explicit  review  to  determine  whether  these  two 
methods  measured  the  same  quality  of  care  constructs,  or  whether  they  measured  entirely 
different  aspects  of  the  quality  of  medical  care.  Happily,  the  two  methods  accorded  well. 
Interestingly,  the  relationship  between  implicit  process  and  mortality  is  strongest  when 
patients  received  very  poor  care,  whereas  the  implicit  review/explicit  review  comparison 
is  strongest  when  comparing  good  or  very  good  care  with  poor  or  very  poor  care. 

Because  the  finding  that  implicit  and  explicit  scores  related  to  each  other  as  well 
as  they  did  was  quite  a  surprise  to  us,  we  were  not  disappointed  that  they  were  not  more 
closely  matched  and  did  not  relate  in  quite  the  same  way  to  outcome.  Conceptually,  the 
implicit  Overall  Quality  score  is  an  amalgam  of  all  the  explicit  quality  scores.  We  would 
not  have  expected  it  necessarily  to  closely  reflect  an  individual  explicit  score.  In  some 
cases,  for  example,  a  poor  Overall  Quality  score  might  be  due  to  poor  hospital  services; 
in  others,  the  deficit  may  be  only  in  physician  care.  The  Overall  Quality  score  on  the 
implicit  review  would  reflect  some  combination  of  these  judgments.  The  fact  that  the 
Overall  Quality  judgment  on  implicit  review  accorded  with  so  many  explicit  review 
scales  may  either  mean  that  the  physicians  were  very  comprehensive  in  their  reviews,  or 
that  poor  quality  in  one  area  correlates  very  highly  with  poor  care  in  other  areas.  The 
fact  that  individual  explicit  scales  did  not  distinguish  poor  and  very  poor  care  as  well  as 
the  process/outcome  link  did  may  be  because  the  very  poor  care  patients  on  the  Overall 
Quality  scale  are  those  who  received  low  quality  care  in  all  areas,  although  any 
individual  area  alone  was  not  always  in  the  very  poor  range. 

We  obtained  our  reliabilities  and  process-outcome  link  while  examining  a 
nationally  representative,  and  therefore  quite  diverse,  sample  of  patients  with  five 
diseases,  using  a  geographically  diverse  group  of  physician  reviewers  (Draper  et  al., 
1990),  both  of  which  might  have  been  expected  to  lower  the  level  of  agreement  among 
raters  (Strumwasser  et  al.,  1990;  Wennberg  and  Gittelsohn,  1973).  Reliabilities  were 
sufficiently  high  to  allow  for  meaningful  comparisons  between  groups  of  patients, 
although  a  single  review  would  not  be  reliable  enough  to  accurately  judge  whether  an 
individual  patient  received  poor  care. 

SIR  could  be  an  excellent  tool  for  sampling  a  wide  range  of  cases  at  a  given 
hospital,  to  identify  instances  of  poor  quality  care.  A  single  reviewer  could  screen 
records  and  accurately  separate  out  the  "good  care"  cases.  Only  one  to  three  additional 
reviewers,  depending  upon  the  disease  studied,  would  be  needed  to  confirm  that  cases 
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screened  as  "poor"  actually  were  poor  with  a  high  degree  of  certainty  (80  percent). 
When  used  to  compare  groups  of  patients  within  a  hospital,  rather  than  to  judge  an 
individual  case,  95  percent  accuracy  would  be  unnecessary.  The  goal  of  the  review 
would  be  the  positive  one  of  improving  care  for  patients  at  the  hospital  in  question,  rather 
than  identifying  and  censoring  single  cases  of  poor  care.  Our  form  could  be  used  in  this 
way  to  screen  for  cases  requiring  a  more  detailed  review  that  would  determine  the  exact 
clinical  etiologies  of  the  problems  encountered. 

As  has  been  found  in  previous  studies,  our  physician  reviewers  differed  somewhat 
in  their  degree  of  harshness  or  leniency.  We  believe  that  our  adjustment  for  reviewer 
characteristics  in  this  regard  may  be  a  practical  method  for  standardizing  the  reviewer 
rating  scale  and  producing  increased  accuracy.  To  acquire  these  data,  review 
organizations  would  merely  need  to  store  ratings  over  a  period  of  time,  perhaps  one 
month,  determine  the  reviewer's  mean  rating,  and  adjust  the  reviewer's  mean  to  the 
mean  of  the  larger  group  of  reviewers.  Each  subsequent  rating  by  the  reviewer  could 
then  be  adjusted  up  or  down  appropriately.  Further  studies  are  needed  to  determine 
whether  these  results  are  generalizable. 

We  hope  that  by  rescrutinizing  and  improving  implicit  review  methodology,  we 
will  encourage  a  reevaluation  and  further  development  of  this  valuable  method  of  quality 
assessment.  We  believe  that  this  will  contribute  to  the  ability  of  hospitals  and  peer 
review  organizations  to  monitor  and  improve  the  quality  of  medical  care. 

When  we  used  implicit  review  to  judge  the  quality  of  medical  care  delivered  to  a 
nationally  representative  sample  of  patients  with  our  five  study  diseases,  we  found  that 
although  most  care  is  judged  to  be  good  or  very  good,  a  significant  proportion  of  care 
was  judged  to  be  poor  or  very  poor.  Despite,  or  because  of,  the  institution  of  prospective 
payment  and  implicit  review  of  medical  records  by  PROs,  the  proportion  of  care  rated 
poor  or  very  poor  has  decreased  between  1981-1982  and  1985-1986  from  about  one- 
quarter  of  patients  to  about  one-eighth  of  patients.  This  finding  confirms  the  result  of 
Kahn  et  al.  (1990a)  who  found  similar  results  using  explicit  review  of  the  medical 
record.  Of  concern,  however,  is  the  increasing  number  of  patients  judged  to  have  been 
discharged  too  soon  and,  in  particular,  with  significant  instability  at  discharge.  Kosecoff 
et  al.  (1990)  have  found  a  similar  increase  in  patients  discharged  unstable  has  been 
reported  using  explicit  review  criteria. 
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In  summary,  SIR  performed  well  as  a  method  for  measuring  the  quality  of  medical 
care  delivered  to  Medicare  patients.  We  were  able  to  document  an  improvement  in  care 
following  implementation  of  the  prospective  payment  system,  tempered  by  an  increase  in 
the  number  of  patients  discharged  unstable.  Both  of  these  findings  mirrored  the  findings 
of  our  explicit  review.  On  the  other  hand,  our  results  indicated  that  the  reliability  of  the 
implicit  review  method  is  low  enough  that  multiple  reviewers  are  necessary  to  identify 
accurately  an  individual  case  of  very  poor  care.  This  finding  adds  a  cautionary  note  for 
those  using  implicit  review  to  identify  poor  performance  in  a  single  case. 


-35- 


Appendix  A 


STRUCTURED  IMPLICIT  REVIEW  QUESTIONNAIRE  FOR  CONGESTIVE 
HEART  FAILURE,  ACUTE  MYOCARDIAL  INFARCTION  AND  PNEUMONIA 


Cass  ID:  AMI 


Review  Date 


Month 


Day 


Year 


4-5/ 
9-lCj 


1)    Please  rate  the  quality  of  physician  and  nurse  documentation  of  each  of  the  following: 
patient's  prior  and  chronic  disease,  functional  status,  habits,  and  psychosocial  status 
prior  to  the  current  acute  illness. 

very 

excellent   good    adequate    poor  poor 
(1)  (2)         (3)         (4)  (5) 

a)  physician  documentation 

of  prior  and  chronic  disease            15/ 

b)  physician  documentation 

of  functional  status  (e.g., 

ambulation)            16/ 

c)  physician  documentation  of 
habits  (e.g.,  alcohol, 

smoking,  diet)            17/ 

d)  physician  documentation 
of  psychosocial  status 
(e.g.,  dementia,  depression, 

nursing  home  residence)            18/ 

e)  nurse  documentation  of  prior 
and  chronic  disease, 
functional  status,  habits, 

and  psychosocial  status            19/ 


f)  Check  here  if  the  record  demonstrates  evidence  that  the  physician  has     ■  [     |  20/ 
ready  access  to  additional  records  that  supplement  the  current  data 
regarding  the  patient's  prior  condition. 


n 


2)    Please  rate  the  physician-  initial  assessment  of  acute  medical  problems  present  at 
admission.    Base  your  answer  on  the  history,  physical,  and  labs. 

very 

excellent    good    adequate    poor  poor 

(1)          (2)         (3)         (4)  (5) 

a)  completeness  of  initial  data 

gathering                                                                .        21/ 

b)  integration  of  admission 
information  and  development 

of  appropriate  diagnoses            22/ 


c)  initial  treatment  plan  and 
initial  orders 


23/ 
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4)  Was  the  choice  of  surgical  or  nonsurgical  treatment  appropriate?  Was 
the  type  of  operation  performed  appropriate?    Was  the  patient 
adequately  stabilized  prior  to  surgery?    Was  the  technical  quality 

of  the  operation  adequate? 

definitely    probably    probably    definitely  not 
yes  yes  no  no  applicable 

a)  appropriate  (1)  (2)  (3)  (4)  (5) 
treatment  choice 

(surgical  vs 

nonsurgical)          46/ 

b)  appropriate  type 

of  operation            47/ 

c)  adequate 
stabilization 

prior  to  surgery            48/ 

d)  adequate 
technical 
quality  of 

the  operation            49/ 

ANSWER  QUESTION  5  ONLY  IF  THE  PATIENT  WAS  DISCHARGED  ALIVE 

5)  Was  length  of  stay  appropriate  given  the  patient's  status  at  discharge 
and  disposition  plans? 

definitely  yes    (1)  50/ 

probably  yes    (2) 

probably  no    (3) 

definitely  no   (4) 

5a)  If  probably  or  definitely  not  appropriate,  how  would  you  describe 
length  of  stay? 

too  short    (1)  51/ 

too  long    (2) 
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5b)  If  length  of  stay  was  too  short  or  too  .long,  what  were  the  apparent 
reasons?    Check  one  or  more  reasons  if  applicable. 


Too  short 

i)  Patient  too  unstable 

ii)  Work-up  incomplete 

iii)  Rehabilitation  incomplete 


iv)  Patient  or  family  insisted 
on  discharge 


Too  long 


v)  Waiting  for  nursing 
home  or  ECF  bed 

vi)  Waiting  for  home 

care  support  service 

vii)  Patient  or  family 

refused  discharge 

viii)  Waiting  for  procedure 


tx 
n 


ANSWER  QUESTION  6  ONLY  IF  THE  PATIENT  DIED  DURING  THE  HOSPITALIZATION 
6)    Was  the  patient's  death  preventable? 


definitely  preventable 
probably  preventable 
probably  not  preventable 
definitely  not  preventable 


(1) 
(2) 
(3) 


52-53/ 
54-551 
56-57/ 
58-59/ 


60/ 


ANSWER  THE  REMAINING  QUESTIONS  FOR  ALL  PATIENTS 

7)    How  would  you  characterize  the  patient's  outcome  at" discharge? 


much  better  than  expected 
better  than  expected 
as  expected 
worse  than  expected 
much  worse  than  expected 


(1) 
(2) 
(3) 
(4) 
(5) 


61/ 


8)    Considering  everything  you  know  about  this  patient,  please  rate  overall 
quality  of  care. 


extreme,  above  standard 
above  standard 
adequatt 
below  standard 
extreme,  below  standard 


(1) 
(2) 
(3) 
(4) 
(5) 


9)  Would  you  send  your  mother  to  these  physicians  in  this  hospital? 


definitely  yes 
probably  yes 
probably  no 
definitely  no 


(1) 
(2) 
(3) 
W 


62/ 


63/ 
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3)    Assume  adequate  to  optimal  care  and  assume  the^patient  recovers  from  this 

episode  of  acute  myocardial  infarction.     Vhat  do  you  believe  is  this  patient's  life 
expectancy? 

<  1  month    (1)  24/ 

1-6  months    (2) 

>  6  months -1  year    (3) 

>•  1  year    (4) 


4)  Assume  adequate  to  optimal  care  by  the  physician  and  hospital.  How 
effective  is  medical  science  in  treating  this  patient's  acute  illness  or 
in  preventing  worsened  health  status  due  to  the  illness?    Consider  the 
severity  of  the  patient's  acute  illness  and  the  patient's  chronic  reserve. 

very  effective    (1)  25/ 

effective    (2) 

not  so  effective    (3) 

very  ineffective    (4) 

5)  Considering  the  entire  hospitalization,  on  average,  was  use  of  these 
services  appropriate  with  respect  to  the  patient's  needs?    If  not 
appropriate,  was  it  because  of  underuse? 

definitely    probably    probably    definitely  under- 
yes  yes  no  no  use 

(1)  (2)  (3)  (4) 

a)  monitoring  intensity         

ai)    intensive  care  ------------------- 

aii)  telemetry  without 

intensive  care  ------------------- 

b)  respiratory  therapy 
delivered         


c)  0^  and  ventilation 


d)  arterial  blood  gases 

e)  physical  therapy 
delivered 

f)  EKGs 

g)  chest  x-rays 

h)  venous  blood 
tests,  urinalyses, 
sputum  analyses 

i)  consultations 

j)  medications  (type 
and  route) 


a) 

26/ 

ai) 

27/ 

a 

aii) 

28/ 

b) 

29-30/ 

c) 

31-32/ 

d) 

33-34/ 

e) 

35-36/ 

a 

f) 

37-38/ 

s) 

39-40/ 

a 

h) 

41-42/ 

i) 

63-44/ 

J) 

45-46/ 
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ANSWER  QUESTION  6  ONLY  IF  THE  PATIENT  WAS  DISCHARGED  ALIVE 

6)    Was  length  of  stay  appropriate  given  the  patient's  status  at  discharge 
and  disposition  plans? 

definitely  yes    (1)  47/ 

probably  yes    (2) 

probably  no    (3) 

definitely  no    (4) 

6a)  If  probably  or  definitely  not  appropriate,  how  would  you  describe 
length  of  stay? 

too  short    (1)  68/ 

too  long    (2) 


6b)  If  length  of  stay  was  too  short  or  too  long,  what  were  the  apparent 
reasons?    Check  one  or  more  reasons  if  applicable. 

Too  short  Too  long 

i)  Patient  too  unstable  -i  \  v)  Waiting  for  nursing  H  j-  69-50/ 

\  1-  home  or  ECF  bed  -I  1- 

ii)  Work-up  incomplete  -I  j-         vi)  Waiting  for  home  H  1-  51-52/ 

-I  r  care  support  service  4— V 

iii)  Rehabilitation  incomplete      ■)  r       vii)  Patient  or  family  -j  J-  53-56/ 

-I  y-  refused  discharge  -I  *■ 

iv)  Patient  or  family  insisted     |     )      viii)  Waiting  for  procedure  -j  f  55-56/ 

on  discharge  I     I  -I  *■ 

ANSWER  QUESTION  7  ONLY  IF  THE  PATIENT  DIED  DURING  THE  HOSPITALIZATION 

7)    Was  the  patient's  death  preventable? 

definitely  preventable    (1)  57/ 

probably  preventable    (2) 

probably  not  preventable    (3) 

definitely  not  preventable    (4) 
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ANSWER  THE  REMAINING  QUESTIONS  FOR.  ALL  PATIENTS 

8)    How  would  you  characterize  the  patient's  outcome  at  discharge? 

much  better  than  expected    (1)  55/ 

better  than  expected    (2) 

as  expected    (3) 

worse  than  expected    (4) 

much  worse  than  expected    (5) 


9)  Considering  everything  you  know  about  this  patient,  please  rate  overall 
quality  of  care. 

extreme,  above  standard    (1)  59/ 

above  standard    (2) 

adequate    (3) 

below  standard    (4) 

extreme,  below  standard    (5) 

10)  Would  you  send  your  mother  to  these  physicians  in  this  hospital? 

definitely  yes    (1)  60/ 

probably  yes    (2) 

probably  no    (3) 

definitely  no    (4) 
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Appendix  B 

STRUCTURED  IMPLICIT  REVIEW  QUESTIONNAIRE  FOR 
CEREBROVASCULAR  ACCIDENT 


Case  ID:  CVA 


1) 


2) 


Review  Date 


-1 — r 

Month    Dav  Year 


9-14/ 


Please  rate  the  quality  of  physician  documentation  of  each  of  the  following: 
patient's  prior  and  chronic  disease  and  functional  status  prior  to  the  current 
acute  illness. 


excellent    good    adequate  poor 
(1)  (2)        (3)  (4) 


very 
poor 
(5) 


a)  physician  documentation 

of  prior  and  chronic  disease 

b)  physician  documentation 

of  functional  status  (e.g., 
ambulation) 


c)  Check  here  if  the  record  demonstrates  evidence  that  the  physician  has 
ready  access  to  additional  records  that  supplement  the  current  data 
regarding  the  patient's  prior  condition. 


n 


151 


161 
171 


Please  rate  the  physician  initial  assessment  of  acute  medical  problems  present  at 
admission.    Base  your  answer  on  the  history,  physical,  and  labs. 


a)  completeness  of  initial 
data  gathering 

b)  integration  of  admission 
information  and  development 
of  appropriate  diagnoses 

c)  initial  treatment  plan  and 
initial  orders 


excellent    good    adequate  poor 
(1)  (2)         (3)  (4) 


very 

poor 

(5) 


18/ 

19/ 
20/ 


3)    Assume  adequate  to  optimal  care  and  essua«  the  patient  recovers  from  this 
CVA.    What  do  you  believe  is  this  patient's  life  expectancy? 


<  1  month 
1-6  months 

>  6  months -1  year 

>  1  year 


(1) 
(2) 
(3) 
(*) 


21/ 
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4)     Assume  adequate  to  optimal  care  by  the  physician  and  hospital.  How 

effective  is  medical  science  in  treating  this  patient's  acute  illness  or 
in  preventing  worsened  health  status  due  to  the  illness?    Consider  the 
severity  of  the  patient's  acute  illness  and  the  patient's  chronic  reserve. 


very  effective    (1)  22/ 

effective    (2) 

not  so  effective    (3) 

very  ineffective    (4) 

5)    Considering  the  entire  hospitalization,  on  average,  was  use  of  these 
services  appropriate  with  respect  to  the  patient's  needs?    If  not 
appropriate,  was  it  because  of  underuse? 

definitely    probably    probably    definitely  under- 
yes  yes  no  no  use 

(1)  (2)  (3)  (4) 


a)  neurologic  examination          •*  r  a)  23-2C/ 

b)  monitoring  intensity          b)  25/ 

bi)    intensive  care       -  -  -    u  bi)  261 


bii)  telemetry  without 


-1  1  bii) 


intensive  care  ------------------  y     \  bii)  27 1 

c)  respiratory  therapy  I  I 

delivered          4—1  c)  28-29/ 

d)  0^  and  ventilation     ,        _  n  d)  30-31/ 


e)  arterial  blood  gases          -»  ►  e)  32-33/ 

f)  physical  or 

occupational  1  ■ H 

therapy  delivered          I     \  f)  36-35/ 

g)  speech  therapy  +— + 

delivered        .   -I  V  g)  36-37/ 

h)  EKGs  h)  38-39/ 

i)  chest  x-rays        i)  CO-^ll 

j)  venous  blood 

tests,  urinalyses,  -j  j- 

sputum  analyses          I     1  j)  t2-&3/ 

k)  consultations       k)  44-45/ 

1)  angiograpy      _  1)  4<*-47/ 

m)  non-invasive  tests    m)  45-4^/ 

n)  CT/NMR  n  n)  50-51/ 


o)  medications 


(type  and  route)          +— +  o)  52-53/ 
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ANSWER  QUESTION  6  AND  7  ONLY  IF  THE  PATIENT  WAS  DISCHARGED  ALIVE 

6)  Was  length  of  stay  appropriate  given  the  patient's  status  at  discharge 
and  disposition  plans? 

definitely  yes    (1)  54/ 

probably  yes    (2) 

probably  no    (3) 

definitely  no    (4) 

6a)  If  probably  or  definitely  not  appropriate,  how  would  you  describe 
length  of  stay? 

too  short    (1)  55/ 

too  long    (2) 

6b)  If  length  of  stay  was  too  short  or  too  long,  what  were  the  apparent 
reasons?    Check  one  or  more  reasons  if  applicable. 

Too  short  Too  long 

i)  Patient  too  unstable  +— +  v)  Waiting  for  nursing  -J—J-  56-57/ 

■I — 4  home  or  ECF  bed  -1— I 

ii)  Work-up  incomplete  -i  j-         vi)  Waiting  for  home  -j  (■  58-59/ 

H  V  care  support  service        1  1 

iii)  Rehabilitation  incomplete     -j  Y       vii)  Patient  or  family  -j  V  60-61/ 

1     I  refused  discharge  i  * 

iv)  Patient  or  family  insisted    +— +     viii)  Waiting  for  procedure  +— j-  62-63/ 
on  discharge  -I  r  4— V 

7)  How  would  you  characterize  the  patient's  functioning  at  discharge? 

not  disabled    (1)  64/ 

somewhat  disabled    (2) 

very  disabled,  but  not  bedridden    (3) 

bedridden    (4) 

ANSWER  QUESTION  8  ONLY  IF  THE  PATIENT  DIED  DURING  THE  HOSPITALIZATION 

8)  Was  the  patient's  death  preventable? 

definitely  preventable    (1)  65/ 

probably  preventable    (2) 

probably  not  preventable    (3) 

definitely  not  preventable    (4) 
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ANSWER  THE  REMAINING  QUESTIONS  FOR  ALL  PATIENTS 

9)  How  would  you  characterize  the  patient's  outcome  at  discharge? 

much  better  than  expected    (1)  66/ 

better  than  expected    (2) 

as  expected    (3) 

worse  than  expected    (4) 

ouch  worse  than  expected    (5) 

10)  Considering  everything  you  know  about  this  patient,  please  rate  overall 
quality  of  care. 

extreme,  above  standard    (1)  67 / 

above  standard    (2) 

adequate  .   (3) 

below  standard    (4) 

extreme,  below  standard    (5) 

11)  Would  you  send  your  mother  to  these  physicians  in  this  hospital? 

definitely  yes    (1)  68/ 

probably  yes    (2) 

probably  no    (3) 

definitely  no    (4) 
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Appendix  C 

STRUCTURED  IMPLICIT  REVIEW  QUESTIONNAIRE 
FOR  HIP  FRACTURE 


Case  ID:  HIP 


1 

Review  Data 


4-8/ 
9-14/ 


Month    Day  Year 


I)    Please  rate  the  quality  of  physician  and  nurse  documentation  of  each  of  the  following: 
patient's  prior  and  chronic  disease,  functional  status,  habits,  and  psychosocial  status 
prior  to  the  current  acute  illness. 

—  very 

excellent   good    adequate    poor  poor 
(1)  (2)        (3)        (4)  (5) 

a)  physician  documentation  of  prior 
and  chronic  disease  (e.g., 
osteoporosis,  Paget' s  disease, 

heart  disease)           


15/ 


b)  physician  documentation  of 
functional  status 

(e.g. ,  ambulation) 

c)  physician  documentation  of  habits 
(e.g.,  alcohol,  smoking,  diet) 

d)  physician  documentation  of 
psychosocial  status  (e.g., 
dementia,  depression, 
nursing  home  residence) 

e)  nurse  documentation  of 
prior  and  chronic  disease, 
functional  status,  habits, 
and  psychosocial  status 


16/ 


17/ 


18/ 


f)  Check  here  if  the  record  demonstrates  evidence  that  the  physician  has 
ready  access  to  additional  records  that  supplement  the  current  data 
regarding  the  patient's  prior  condition. 


19/ 
20/ 


2)    Please  rate  the  completeness  of  the  physician  initial  assessment  of  the  hip  fracture  and 
of  acute  medical  problems  present  at  admission.    Base  your  answer  on 
the  history,  physical,  and  labs. 


a)  hip  fracture 

b)  acute  medical  problems 


excellent 

(1) 


good 
(2) 


adequate 

(3) 


poor 
(4) 


very 
poor 
(5) 


21/ 
22/ 
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3)     Considering  the  entire  hospitalization, .on  average,  was  use  of  these 
services  appropriate  with  respect  to  the  patient's  needs?     If  not 
appropriate,  was  it  because  of  underuse? 

definitely    probably    probably    definitely  urider- 
yes  yes  no  no  use 

(1)  (2)  (3)  (4) 


a)  monitoring  intensity          a)  23/ 
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